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ABSTRACT 


A procedure  is  described  for  determining  a decision  rule  for  the 
one-dimensional,  two  class  recognition  problem  with  unknown,  nonpara- 
metric,  class-conditional  density  functions.  A priori  class  proba- 
bilities are  known,  and  the  densities  are  assumed  to  satisfy  Lipschitz 
conditions  with  known  Lipschitz  constant.  The  procedure  is  essentially 
a histogram  approach  where  the  partition  for  the  histogram  is  changed 
as  directed  by  a performance  measure.  It  is  desirable  to  minimize 
the  difference  between  the  probability  of  a recognition  error  when 
using  the  decision  rule  and  the  minimum  attainable  probability  of 
recognition  error.  For  a fixed  partition  conditions  are  stated  that 
assure  achievement  of  a specified  confidence  that  this  difference  is 
below  a specified  constant.  The  variable  partition  procedure  operates 
with  limited  storage  and  allows,  but  does  not  assure,  attainment  of 
the  specified  confidence.  Computer  simulated  results  are  given  that 
experimentally  illustrate  attainment  of  the  desired  confidence  for 
the  problems  considered.  A technique  is  suggested  for  extending  the 
procedure  to  nultidimenaiona . This  technique  converts  the  multidimen- 
sional problem  to  a one-dimensional  problem.  It  operates  by  mapping 
sets  in  a multidimensional  domain  one-to-one  onto  sets  in  a one- 
dimensional domain.  Computer  simulated  results  are  presented. 


CHAPTER  I 


INTRODUCTION 


1.1  The  Problem 

One  of  the  problems  in  comjwterized  recognition  is  that  of 
assigning  a vector  observation  to  one  of  several  classes.  Applica- 
tions include  the  recognition  of  properties  of  waveforms  or  pictures 
which  are  represented  by  vectors.  The  total  recognition  problem 
should  include  the  following  operations: 

A)  Select  sensors  for  the  problem  and  represent  the  sensor  outputs 
for  each  waveform,  picture,  or  etc.  by  an  f— dimensional  vector.  This 
operation  involves  expert  problem  knowledge. 

B)  Represent  the  I -dimensional  vector  with  a vector  in  a X -dimen- 
sional space  (X  < t)  called  the  observation  space  and  denoted  V*. 

This  is  accomplished  using  a data-dependent,  dimensionality  reducing 
mapping.  Denote  with  j a 4-dimensional  vector  in  . 

C)  Recognize  the  X-dimensional  vector  by  assigning  it  to  one  of 
several  classes  using  a classification  procedure  conditioned  on 
previously  processed  observation  vectors  called  training  observations. 

Examples  of  applications  include  automatic  sonar  and  radar  detec- 
tion and  classification,  medical  diagnosis  including  electrocardiograms 
and  electroencephalograms,  aerial  photography  processing  for  earth 
resource  studies,  and  quality  control. 


f 
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This  report  is  concerned  with  c);  thus,  the  problem  begins  with 
a set  Yn  of  n,  /-dimensional  vector  training  observations  (often 
called  patterns  Til)  in  a /-dimensional  observation  space  denoted  V*. 
Using  the  set  Y^  together  with  available  a priori  knowledge,  an 
observation  & is  assigned  by  the  classification  procedure  to  one  of 
several  classes. 

The  assumptions  and  constraints  specifying  the  particular  classi- 
fication problem  considered  in  this  report  are: 

1)  A vector  observation  £ is  to  be  assigned  to  one  of  two  classes, 
denoted  and  The  a priori  probabilities  and  that  £ 
belongs  respectively  to  oi^  or  u> 2 are  known. 

2)  The  observation  space  is  l-diraensional  (Chapter  V describes  a 
method  for  extending  the  results  to  /-dimensions). 

3)  The  training  observations  are  supervised;  that  is,  the  correct 

classification  of  each  observation  in  Y is  known.  The  number  of 

n 

training  observations  belonging  to  is  n^  with  n^  + n,,  = n. 

U)  Training  observations  belonging  to  are  each  independently 
and  identically  distributed  according  to  an  unknown  class-conditional 
density  function  f^  defined  over  the  observation  space.  Class  oi^ 
observations  are  independent  of  class  u>2  observations,  fj  is  not 
assumed  to  be  parametric;  it  cannot  necessarily  be  characterized  by 
a finite  number  of  parameters.  It  is  assumed  that  f^(x)  1*  zero  for 
x outside  a known  bounded  domain  4.  Without  loss  of  generality 


(x  : 0 < x < 1 ) 


(1.1) 


v 


In  addition,  it  is  assumed  that  f^  satisfies  a Lipschitz  condition 


( 
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Jfj(x)  - fj(y)|  < Lj|x  - y|  , x,y « 4 (1.2) 

with  Lipschitz  constant  known  a priori. 

5)  The  amount  of  computer  storage  is  limited. 

The  result  of  training  (processing  the  training  observations) 

is  the  specification  of  a decision  rule  d defined  on  £ and  taking 

on  the  values  1 or  2.  The  rule  d divides  i)  into  two  sets  identified 

by  their  assigned  classes.  An  observation  at  x is  assigned  to  class 

»./  \.  The  choice  of  d minimizing  the  probability  of  a classification 
dlx; 

error  is  a minimum  risk  procedure.  Details  of  such  procedures  are 
found  in  references  f 2,3,20  ].  The  probability  Pr(e!d)  of  classifi- 
cation error  when  using  d is  given  by 

Meld) -J,Pj(][)fJ(x)(x)  dx  (i.3) 

where  cT(x)  indentifies  the  class  qo£.  assigned  by  d to  an  observation 
at  x.  An  optimum  decision  rule  dQ  is  defined  as  one  that  minimizes 
Pr(ff|d).  If  J is  considered  to  be  the  argument  of  Pjfj(x),  then  dQ(x) 
is  given  for  each  x in  i by 

dc(x)  - Arg  J PjfjUl]  (1-A) 

The  corresponding  minimum  probability  of  error  is 

Me|d0)  - J,  Pjfj(x)]  dx  (i.5) 

Figure  la  illustrates  a decision  rule  d for  a particular  example. 


The  croas-hatched  area  represents  the  corresponding  probability  of 

error,  Pr(f|d).  Similarly  Figure  lb  illustrates  d and  Pr(«!d  ). 

o 1 o 

Figure  lc  illustrates  the  excess  of  Pr(ffjd)  over  Pr(P|do). 

1.2  The  Goal 

An  optimum  decision  rule  dQ  is  defined  in  terms  of  the  class- 

conditional  density  functions*  f^.  For  the  problem  being  considered 

these  d.f.'s  are  unknown;  however,  they  can  be  estimated  and  the 

estimates  fj  substituted  into  (1.4)  in  place  of  f^.  The  result  is 

a decision  rule  d that  is  an  estimate  for  the  decision  rule  d . The 

o 

overall  objective  is  to  satisfy 

PrrPr(«|d)  - Pr(«|d0)  < or]  > 8 (1.6) 

for  prespecified  constants  or  and  8 in  the  interval  r0,l].  In  words, 
the  goal  is  to  achieve  a specified  confidence  that  the  excess  of 
Pr(F|d)  over  Pr(F|dQ)  is  less  than  a specified  constant. 

1.3  Literature  Survey 

The  previously  described  problem  of  obtaining  a decision  rule 
constrained  by  limited  storage  and  with  a goal  given  by  ( 1 .6 j has 
apparently  received  no  previous  attention.  The  closest  results  are 
probably  due  to  Pu  and  Henrichon  [4l  who  find  constants  a'  and  P 
so  that 


Prf  Pr(c|d)  < a'  ] >8 


(1.7) 


♦Hereafter  the  phrase  class-conditional  is  dropped,  and  f is  referred 
to  as  a density  function  (abbreviated  to  d.f. ).  ^ 


- 6 - 


is  satisfied.  This  condition  provides  a statement  about  the  size  of 
Pr(ff|d)  whereas  condition  (1.6)  for  the  current  problem  is  concerned 
with  the  size  of  Pr(ff|d)  relative  to  Pr(£|dQ).  For  the  special  case 
when  Pr(C|do)  is  known  or  is  known  to  be  negligibly  small  with 
respect  to  or,  (1.6)  and  (1.7)  are  equivalent.  Otherwise  (1.6)  offers 
the  advantage  of  providing  information  concerning  the  amount  of 
improvement  in  performance  obtainable  by  processing  additional  train- 
ing observations.  Fu  and  Henrichon's  procedure  operates  on  all  the 
training  observations  essentially  simultaneously  and  requires 
increasing  computer  storage  as  the  number  of  training  observations 
increases;  thus  it  is  not  applicable  with  the  current  storage  con- 
straint. 

The  missing  link  in  a straight-forward  application  of  (1.4),  to 
obtain  d,  is  the  method  of  obtaining  the  estimate  d.f.'s  fj  from  the 
n^  class  <i>j  training  observations.  The  limited  storage  constraint 
complicates  this  estimation. 

Abramson  and  Braverman  [51,  and  Keehn  [61  consider  estimates  of 
the  form 


(1.8) 


where  f j is  a member  of  the  family  of  Gaussian  d.f.'s.  They  estimate 

parameters  (mean  vectors  [5],  mean  vectors  and  covariance  matrix  f61) 

characterizing  tj.  With  * complete  orthonormal  set  for  the 

unknown  ty  Aizerman,  Braverman,  and  Roaonoer  f 77  obtain  estimated 

parameters  5ji  ^ 

R 

f J " - Vji 

i-1 


(1.9) 
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and  show  that  fj(*)  converges  in  probability  to  fj(*)  for  each  * in 

the  domain  4.  Tsypkin  [81  also  uses  an  orthonomal  set  ^i=i 

to  get  an  estimate  of  the  form  (1.9),  but  he  does  not  assume  it  to 

* 

be  complete  for  fj.  Tsypkin  obtains  estimates  with  the  goal  to 
minimize  the  Integral  Square  Error  (ISE), 

.2 


ISE  “ d* 


p 

Kashyap  and  Blaydon  [9]  assume  only  that  ft 1 4r®  ^■^na4r^y 
independent  functions.  With  an  estimate  f ^ of  the  form  (1.9),  they 
consider  minimi zing  both  the  ISE  and  the  Mean  Square  Error  (K3E), 

.2 


MSE  - J*4(f j<i)  - f j(*))  f j(i)  di 


For  the  1 -dimensional  case  Rosenblatt  tlOl  considers  an  estimate 
of  the  form  (1.9)  for  fj(x)  where  R *=  n^  and  f ^ is  a function  obtained 
from  the  ith  class  ujj  observation  x^. 


“J 

f » V — L * 

J L n * ji 

i-1  J 


(1.10) 


Parsen  Till  shows  that  if 


Vx)  ■ r-  Kl  . 

j "j 


<1.11) 


where 


♦In  this  section  on  d.f.  estimation,  4 can  be  multidimensional  unless 
otherwise  stated. 


I 'J_ 
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fo  ] and  the  function  K satisfy  certain  conditions,  then 

nj 


nji"-  E [('/*>  * }iM)  ] - 0 


for  each  x in  the  domain  at  which  fj  is  continuous.  Because  of  the 
form  (l.lO),  this  estimate  has  increasing  complexity  as  n,  increases. 
References  [12,13,14,151  also  deal  with  this  type  of  d.f.  estimation. 
The  well  known  histogram  technique  for  estimating  d.f.'s  defines 

D 

the  functions  {*^1^  48  the  38fc  °f  indicator  functions  on  the 
regions  of  a R-region  partition  of  0.  For  the  ith  region. 


♦ “ 1,  4 in  the  i region 


0,  otherwise 


fj  is  given  by  (1.9).  This  is  a special  case  of  the  problem  considered 
in  references  [7,8,9]. 

The  nearest  neighbor  decision  rule  or  rather  the  more  general  K- 
nearest  neighbor  decision  rule  [l6]  assigns  an  unclassified  observation 
to  the  class  aost  heavily  represented  among  its  K nearest  training 
observations.  This  rule  has  been  shown  [17, 18!)  to  have  similarities 
with  a decision  rule  resulting  from  using  density  function  estimates 
in  (1.4).  It  has  been  shown  T 161  that  the  nearest  neighbor  rule 
results  in  an  asymptotic  (n^  -*  •)  probability  of  error  that  is 
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less  than  twice  the  minimum  attainable.  Processing  requires  storage 
for  all  training  observations^;  thus  these  results  cannot  be  used 
when  one  operates  with  a storage  constraint.  Hart  [19]  has  suggested 
an  interesting  storage  reducing  modification  of  the  nearest  neighbor 
rule  which  he  calls  the  condensed  nearest  neighbor  (CNN)  rule.  The 
CNN  rule  discards  a set  of  training  observations  from  the  original 
set.  The  discarded  set  consists  of  training  observations  that,  if 
treated  as  unclassified  observations,  are  classified  correctly  by 
the  nearest  neighbor  rule  when  used  with  the  training  observations 
retained.  The  storage  requirement  is  reduced,  and  the  criterion  for 
discarding  a training  observation  is  based  on  the  capability  of  the 
retained  observations  to  make  decisions.  Supporting  theory  for  the 
CNN  rule  has  not  yet  been  published. 

The  Gaussian  assumption  in  Abramson  and  Braverman's  work  is  too 
restrictive  for  the  problem  outlined  in  Section  1.1.  The  work  of 
Aizerman,  Braverman,  and  Rozonoer,  Tsvpkin,  and  Kashyap  and  Bla/don, 
along  with  the  histogram  approach  is  either  too  restrictive  (small  R) 
or  requires  too  much  storage  (large  R). 

Unsupervised  estimation  ("20,21,22,23,24,25]  allows  the  estimate 
of  (1.9)  to  be  more  general  by  providing  a way  to  estimate  parameters 
c'isracterizing  each  f ^ in  f ♦ ^ ^ } i=l  as  we^  a3  lighting  coeffi- 
cients . 

Another  approach  r 26, 27, 28]  that  adapts  the  t^'s  to  the  data  is 
t»sed  on  distribution  free  tolerance  regions.  Instead  of  defining  a 
partition  of  fi  beforehand  as  in  the  histogram  approach,  a procedure 


t 
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is  given  for  defining  the  partition  in  terras  of  the  training  obser- 
vations; then  the  distribution  free  techniques  described  in  references 
r 2,291  can  be  used. 

Sebestyen  ^ 30,311  considers  a method  that  is  similar  to  the 
Parzen  technique  but  uses  limited  storage.  Training  observations  in 
close  proximity  with  one  another  in  f)  are  lumped  into  an  average 
observation.  Sebestyen's  estimate  d.f.  is  in  the  form  (1.9)  where 
i is  a Gaussian  d.f.  having  mean  at  the  ith  average  observation 
and  variance  related  to  the  size  of  the  region  in  which  observations 

A 

contribute  to  the  average.  is  the  relative  frequency  of  obser- 

vations in  the  region.  The  procedure  does  not  have  the  properties 
that  Parzen  used  in  his  convergence  proof. 

Specht  [32]  reduces  the  storage  required  in  a utilization  of  the 
Parzen  approach  by  expanding  estimates  in  the  form  (1.10)  into  a 
Taylor  series  about  a selected  point  in  A and  then  retaining  only 
the  low  order  terms.  The  resulting  truncated  Taylor  series  is 


accurate  only  near  the  point  of  expansion.  To  obtain  accuracy  over 
the  whole  domain,  the  expansion  should  be  carried  out  at  each  of 
sufficiently  many  points  in  the  domain.  A different  set  of  coeffi- 
cients must  be  stored  for  each  expansion;  thus  the  storage  required 
would  increase  in  proportion  to  the  number  of  expansion  points  used. 

When  estimating  d.f.'s,  one  must  use  care  to  choose  a suitable 
estimation  criterion.  This  is  especially  true  if  one  is  faced  with 
the  problem  of  estimating  while  being  constrained  with  limited  storage. 


f- 
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If  the  d.f.'s  cannot  be  characterized  by  a number  of  parameters  that 
will  fit  into  the  limited  storage,  then  some  information  must  be 
discarded.  In  this  case,  the  criterion  should  not  require  accurate 
estimation  where  it  is  not  needed  because  such  accuracy  is  obtained 
at  the  expense  of  accuracy  where  it  is  needed.  When  the  goal  of 
the  estimation  is  for  the  estimate  d.f.'s  to  make  good  decisions  if 
substituted  for  the  actual  d.f.'s  in  (l.4),  it  is  reasonable  that 
some  measure  of  the  quality  of  these  decisions  should  be  used  as  the 
estimation  criterion.  Since  (l.4)  involves  a d.f.  for  each  class, 
the  estimation  of  one  function  should  involve  interaction  with  the 
estimation  of  the  other  function.  With  the  exception  of  the  K-nearest 
neighbor  rule,  the  above  d.f.  estimation  procedures  do  not  have  this 
property. 

For  other  work  related  to  computerized  recognition,  the  reader 
is  referred  to  the  survey  articles  by  Nagy  [33],  and  Ho  and  Agrawala 
[34]  which  contain  extensive  lists  of  references. 

1.4  The  Approach 

The  d.f.  estimation  used  in  this  report  is  essentially  a histo- 
gram approach  but  with  the  partition  periodically  adapted  to  improve 
a measure  of  Derformance.  Enough  storage  is  assumed  available  to 
handle  parameters  associated  with  each  interval  in  the  partition. 

The  supposition  is  that  a number  R of  intervals  too  restrictive  in 
the  ordinary  histogram  approach  may  be  adequate  with  the  adaptive 
capability.  This  idea  is  suggested  by  the  fact  that  a R-interval 
histogram  d.f.  estimation  procedure  is  capable  of  giving  an  optimum 


Figure  2.  Two  Interval  Histogram 
Estimation  Giving  an  Optimum  Decision  Rule. 
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decision  rule  provided  the  problem  has  fewer  than  R decision  thresholds. 
Figure  2 illustrates  a one  threshold  example  optimally  solved  with  a 
two  interval  histogram  estimation  of  f^  and  f 

The  framework  or  model  within  ich  the  classification  procedure 
operates  is  now  described.  Consider  a partition  I of  the  domain  & 
into  R intervals.  Label  these  intervals  «Jj(l),  *»j(  2) , . . . , Jj(R)  and 
the  interval  widths  WjQ),  Wj(2), . . ..WjU) . Define  the  probabilities 

Pjj(i),  P{J(2),...,PjJ(R),  3 = 1,2,  by 


f*.(i)4j  f.(x)  dx 

IJ  (i)  J 


i = 1, ...,R  (1.12) 

3 - 1,2 


Although  the  P*'s  are  unknown,  any  a priori  knowledge  concerning 

them  is  represented  by  the  notation  A.  The  set  consisting  of  the 

first  n training  observations  is  denoted  Y^.  Through  the  use  of  A 

and  Y , the  classification  procedure  obtains  estimates  f . conditioned 
n’  J 

on  I, A,  and  Y . In  the  remainder  of  this  report,  the  partition  I, 
n 

the  a priori  knowledge  A,  and  the  training  observations  Y^  will  be 
understood  from  the  text  and  are  omitted  from  the  notation. 

Given  a partition,  the  estimate  for  f j is 


l W(i)  *i 


3 = 1,2 


(l.i3) 


i=l 

where  ^ is  the  indicator  function  for  the  ith  interval.  Pj(i) 
is  the  expected  value  of  a distribution  on  Pj(i)  which  is  a random 
variable  describing  the  current  uncertainty  of  P*(i).  The  a priori 
knowledge  A,  or  in  its  absence  the  first  few  training  observations, 
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are  used  to  assign  this  distribution  initially.  It  is  updated  with 
subsequent  training  observations  through  the  use  of  Bayes  Rule.  By 
adjusting  the  variance  of  the  initial  distribution,  its  effect  on 
the  result  can  be  made  large  or  small  as  desired. 

When  the  resulting  estimates  are  used  in  place  of  f^  and  f^  to 
obtain  decision  rule  d,  there  can  be  no  finer  resolution  of  decision 
thresholds  than  the  boundaries  of  the  intervals  comprising  the  parti- 
tion. For  this  reason  the  capability  of  altering  the  partition  is 
included  in  the  model. 

If  the  number  R of  intervals  in  the  partition  is  greater  than  or 
equal  to  the  number  of  decision  thresholds  plus  one,  then  the  model 
is  capable  of  giving  an  optimum  decision  rule.  An  optimum  decision 
rule  is  attained  when  all  thresholds  coincide  with  interval  boundaries 
and  when  each  interval  is  classified  correctly  through  use  of  the 
estimate  functions. 

A general  description  of  the  approach  used  to  satisfy  condition 
(1.6)  is  now  presented.  The  discussion  follows  the  system  flow 
diagram"*"  of  Figure  3. 

a)  Initialization 

Initially,  a partition  and  a distribution  on  each  P^(i)  is  assigned. 
This  assignment  is  based  on  a priori  knowledge  about  P*(i). 

b)  Updating 

A set  of  supervised  training  observations  is  used  to  update 
the  distribution  on  each  Pj(i)  through  use  of  Bayes  Rule. 

+This  flow  diagram  corresponds  to  an  actual  implementation,  the  results 
of  which  are  presented  in  Chapter  IV. 
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c)  Classification 

The  constants  or  and  1-0  are  allocated  to  the  intervals  and  a 
set  of  R conditions,  one  for  each  interval,  similar  to  condition  (1.6) 
for  the  whole  domain,  is  developed.  These  interval  conditions  taken 
together  are  sufficient  for  (1.6).  The  interval  condition  for  each 
interval  is  checked  independently  of  the  others.  A record  is  made 
of  any  interval  whose  interval  condition  is  satisfied  and  of  the 
class  assigned  to  that  interval;  (such  an  interval  is  said  to  be 
classified).  If  all  intervals  are  classified  then  processing  is 
stopped  with  the  statement  that  condition  (1.6)  is  satisfied.  Other- 
wise processing  continues. 

d)  Adjust  the  Partition 

The  classification  rate  of  an  interval  is  defined  as  the  total 
probability  in  the  interval  divided  by  the  number  of  training  obser- 
vations required  to  classify  it.  The  unclassified  intervals  are 
ranked  in  a priority  table  according  to  estimates  of  the  maximum 
possible  classification  rates  for  the  intervals.  The  maximization 
is  with  respect  to  interval  width.  The  partition  is  adjusted  by 
considering  the  intervals  one  at  a time  in  the  order  that  they  appear 
in  the  priority  table.  An  interval  is  either  split  into  two  intervals, 
combined  with  one  of  its  adjacent  intervals,  or  left  unchanged  accord- 
ing to  a rule  based  on  a measure  of  performance  and  the  storage  con- 
straint. After  partition  adjustment,  a priori  knowledge  is  reassigned 
to  the  intervals.  The  process  repeats  as  often  as  is  necessary  according 
to  the  flow  diagram  of  Figure  3. 


i 
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i ,5  ftflPQrt  gcnalnUan 

Chapter  II  contains  the  details  of  the  approach  for  a fixed 
partition.  The  initialization  and  updating  of  the  distributions  on 
the  Pj(i)'s  is  discussed.  The  use  of  these  distributions  to  obtain 
estimate  d.f.'s  and  the  subsequent  use  of  the  estimates  to  obtain  a 
decision  rule  is  described.  Next,  a set  of  interval  conditions  that 
is  sufficient  for  condition  (l.6)  is  derived. 

Chapter  III  describes  an  ad  hoc  approach  for  altering  the  parti- 
tion in  order  to  arrive  at  the  goal  with  limited  storage  and  with 
fewer  training  observations. 

Chapter  IV  contains  computer  simulated  results.  Experimental 
studies  are  included  on  the  effects  of  tradeoffs  between  a and  B of 
condition  (1.6)  and  the  total  number  of  training  observations 
required  for  satisfying  it.  Also  studied  are  the  effects  of  altering 
the  Lipschitz  constants,  the  number  of  intervals,  and  the  number  of 
training  observations  observed  between  times  of  making  computations. 

Chapter  V contains  suggestions  for  extending  the  approach  to 
the  multidimensional  case  via  a technique  that  transforms  the  isulti- 
dimensional  problem  into  a 1 -dimensional  one.  Possible  uses  for  the 
mapping  other  than  computerized  recognition  are  discussed. 

Chapter  VI  sumna rises  the  results,  their  possible  engineering 
application,  and  suggests  ways  in  which  they  might  be  improved 
and  ax* ended. 


— 
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CHAPTER  II 

SOLUTION  FOR  A FIXED  PARTITION 


2.1  Introduction 

This  chapter  contains  a description  of  the  technique  employed 
for  a fixed  partition  I of  the  domain.  Estimation  of  density  func- 
tions and  a decision  rule  are  discussed.  The  difference  between  the 
probability  of  error  using  the  estimated  d.f.'s  and  that  using  the 
actual  d.f.'s  is  expanded  into  a sum  of  difference  probabilities 
where  each  difference  probability  corresponds  to  an  interval  in  the 
partition.  The  objective  is  to  achieve  a specified  confidence  that 
the  sum  is  less  than  a specified  constant.  A method  that  operates 
by  considering  each  interval  independently  is  developed  for  checking 
whether  the  confidence  is  attained. 

2.2  Density  Function  Estimates. 

A piecewise  constant  estimate  of  the  J*"*1  class-conditional  d.f. 
is 

i~l 


(2.1) 
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The  random  vector*  Pj  - (Pj(l),  Pj(2),  ...,  Pj(R))  has  the 
R-l  variate  Dirichlet  density  function 


m (i)  - 1 
R <Mi)  3 


= r(T  m (i))  FT  -J-7 7 , 0*P(i)<l 

J 'll  r(®j(i))  J 

I PjU)  - 1 


i=l 


= 0 


, otherwise 


(2.2) 


assuming  an  a priori  Dirichlet  density  function  on  P^  and  subsequent 
training  observations  where  m^  = (nij(l),  nij(2),  « • • » m^(R))  * Each 

m.(i)  is  obtained  from  training  observations  and  a priori  knowledge 

J 

about  P.(i)D5)»  P,(i)  has  the  beta  (univariate  Dirichlet)  density 

J J 

function, 

e(r»j(i)  lVj-(i)*  V j2( i )) 


r(V1)K''J2<1>) 


V1l(i)  “ 1 

p3(d  » 


(>  -■vi>) 


VJ2(1)-1 


0 *Pj(i)  < 1 


- 0 


otherwise 

(2.3) 


where 


+t  indicat ee  transpose. 
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Vjl(i)  = mj(i) 

YJ2(i)  = L.  mj(k)  (2.4) 

k^i 


The  mean  and  variance  of  Pj(i)  are 


Yn(i) 

Ej(i)  = VJ2(i)l  = Y^(i)  + vJ2(i) 


Var 


|(i)  = E [(^(i)  - E.(i))2|Yn(i),  YJ2(i)_ 


Ei(  i)r  1 -E,(i)] 

Vi)+vi)+1 


(2.5) 


If  Pj(i)  = Ej(i),  then 


The  components  of  m may  not  be  consistent  with  a priori  know- 

JL 

ledge  of  the  expected  value  and  variance  for  each  p,(i).  For  this 
reason  and  because  each  interval  is  to  be  considered  independently, 
the  Dirichlet  d.f.  is  abandoned  in  favor  of  an  independent  beta  d.f. 
on  each  P^(i).  Then,  it  is  consistent  to  constrain  the  y's  as  follows 


■*" 
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V1’  ‘ *J2(i)  + VJ2a)  <2-6) 

where  the  e's  and  v's  account  respectively  for  a priori  knowledge  and 
training  observations.  The  ability  to  use  a priori  knowledge  is 
important  for  the  partition  changing  technique  developed  in  Chapter  III. 
An  "a  priori  d.f."  on  Pj(i)  is  converted  to  an  "a  posteriori  d.f."  by 
using  a Bayes  iteration: 

Pr(v^1(i)>  Vjg(l)  |pj(i),  3j1(l),Sj2(i))8(p1(i)|3n(i),  a^2(i)) 

1 

r Numerator  dP.(i) 

' 0 J 

(2.7) 


The  iteration  includes  the  information  that  out  of 


nj ' Vl>  + V(1) 


(2.8) 


training  observations  from  the  j“h  class,  v^(i)  are  in,  and  v^i) 
are  out  of  the  i^*1  interval. 

Appendix  A considers  approaches  for  specifying  the  s's  that 
characterize  the  a priori  d.f.  on  Pj(i).  F>roni  (2.6),  it  is  seen 
that  enough  training  observations  will  eventually  cause  the  effects 
of  the  s's  to  be  negligible  (provided  each  interval  probability  is 
greater  than  zero). 


ZPrr: 
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.J.3  The  Decision  Rule 

An  estimate  of  the  minimum  probability  of  error  decision  rule  is 
d(x)  = Arg  [ PjfjOO],  (2-9) 

The  difference  between  the  probability  of  er**or  when  using  d and  the 
probability  of  error  when  using  an  optimum  decision  rule  dQ  is 

R 

Pr(«|d)  -Pr(^|do)=  ~ [Pr(f|i,d)Pr(i!d)  -Pr(f|i,do)Pr(ildo)] 

i=l 

where  Pr(P-|i,d)  is  the  probability  that  d errors  in  classifying  an 
observation  in  the  ith  interval.  The  probability  that  an  observation 
is  in  the  ith  interval  is  Pr(i)  and  is  independent  of  d.  Thus, 

R 

Pr(£!d)  - Pr(fJd  ) = Q(i,d,d  ) 
o — o 

i=l 


where 

Q(i,d,do)  = rPr(Hi,d)  - Pr(A>|i,do)lPr(i) 

The  next  section  is  devoted  to  obtaining  a sufficient  condition 
for  the  goal 

PrTPr(P|d)  - Pr(C|dQ)  < «1  > 3 

Then,  computational  techniques  are  developed  for  checking  if  this 
sufficient  condition  is  satisfied  for  a given  partition. 


2.U  A Sufficient  Condition 


The  following  proposition  gives  a set  of  interval  conditions 
(one  for  each  interval  in  the  partition)  such  that  satisfaction  of 
all  of  them  is  sufficient  for  Condition  (1.6). 

Preposition  1 

Given : 

a)  Constants  a and  0 such  that 

0 < or  < 1 
0 < 0 < 1 

b)  Constants  or(i)  > 0 and  t(1)  > 0,  i = l,  ...,  R,  such  that 

R 

Y «(i)  = a 

i=l 

R 

Y T(i)  = 1 - B 
i=l 

Then  the  set  of  interval  conditions 

Prfy(i,d,do)  > ®(i)]  < T(i)  , i=l,  ...,  R (2.10) 

implies 

Prf Pr( fc  | d ) - Pr^-ld^  < or]  > 0 
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Proof 


The  set  of  conditions  (2.10)  implies  that 

R R 

Y Pr[Q(i,d,dQ)  > «U)1  < /]  tU) 

i=l  i=l 


It  follows  that 

. R R 

Pr  r U (u(i,d,do)  > »(i))]  < £ t(1) 

"i=1  1-1 


From  de  Morgan's  laws 


R R C 

Pr  £ U (Q(i,d,d0)  > «(i))J  = Pr  [-^^(i,d,do)  < o(i)); 


where  the  superscript  "C"  indicates  complementation. 
Then 

R 

0 

■1=1 


R 


Pr  r n (c,(i,d,do)  < w(i))  ] > 1 - l T(i)  = 0 

i=l 


which  implies  that 

R R 

Pr[Z  Q(i»d»do)  - Z »(!)]>  0 

i=l  i=l 


The  conclusion  follows 
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?,5  Classification  Procedure  for  the  1th  Interval 
Consider  just  the  ith  interval. 

Define 


PjPJ(i) 

V1*  - 


The  d.f . on  U (i)  is 


/ \ wfi)  /W(i)“j(i)  I 

i\Uj(i)|Yj1(i),  YJ2(i)y  - p «(  | Vj^i),  Vj2^ 1 ^ 


(2.11) 


Define 


Uj(i)  = EU^(i) 


Jl. 

W(i) 


\u(1) 

Vn(1)  + VJ2<1) 


0J(1)  * v,r  V1’  ■ (sfrir)  • 


Y.n(i) 


v.12(1) 


(v,i(l)+Y,2(i))2(v,i<i)+V,2(i,+1) 


r corresponds  to  the  class 

t(i)  = Arg  . - u .(i)  I : chosen  by  d in  the  1th 

LJ  3 J interval. 


. / . \ . corresponds  to  the  class  not  chosen  by  d 
' in  the  i^h  interval 


(2.12) 

To  avoid  redundant  notation  a(i)  and  b(i)  are  denoted  a and  b when  the 
ith  interval  is  understood. 
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The  objective  is  to  find  a region  V(i)  in  the  ftJa(i),  Ub(i)j 
plane  containing  all  points  for  which 

Q(i . d,  dQ)  > o(i) 

is  possible.  Then  the  probability  p/v(i))  is  an  upper  bound  for  the 
probability 

Pr  [Q(i,  d,  dQ)  > a(i)] 

Pr(v(i)y  is  obtained  by  integrating  the  d.f.*  on  ^Ua(i),  Ub(i)^  over 
points  in  V ( i ) . 

Pr  rQ(i,  d,  dQ)  > a(i)^  < Pr^V(i)' 

V(i) 

The  ith  interval  condition  is  satisfied  if 

Pr^V(i) j < T(i)  (2.13) 

The  region  of  integration  V(i)  is  obtained  as  the  intersection 
of  two  regions  V^(i)  and  V,,(i),  each  containing  all  points  in  the 
fu  (i),  Ufc(i)^  plane  for  which  ^Q(i,  d,  dQ)  > a(i )'j  is  possible. 

Define  the  events 

•The  d.f.  on  ^Ua(i),  Ub(i)j  is  the  product  of  d.f.'s  defined  by  (2.11) 
on  U (i)  and  U.  (i)  separately  because  those  d.f.'s  are  obtained  from 

8k  D 

independent  samples. 
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V^i)  = (<2(l,  d,  dQ)  > o) 

V2(i)  = (q(i,  d,  dQ)  > «(i)) 

Because  of  the  possible  variation  of  the  density  functions  f^  and  f^ 
about  their  averages  in  the  ith  interval,  it  is  not  possible  in 
general  to  specify  V^(i)  and  V2(i)  as  regions  in  the  flJa(i),  U^(i)^ 
plane.  However,  the  following  proposition  gives  regions  ^(i)  and 
V2(i)  that  contain  V^(i)  and  V2(i)  respectively.  Note  that* 

V2(i)  c V^i)  and  thus  V2(i)  V^(i).  Careful  examination  shows  that 

V2(i)  may  contain  points  for  which  it  is  known  that  Q(i,  d,  dQ)  = 0. 
Elimination  of  these  points  from  the  region  of  integration  yields  a 
smaller  upper  bound.  Intersection  of  V2(i)  with  V^i)  to  obtain  V(i) 
accomplishes  the  elimination  of  these  points.  Figure  U illustrates 
with  Venn  diagrams  the  set  relations  involved. 

The  proof  of  the  proposition  uses  an  easily  proved  statement 
relating  the  range  of  variation  of  a density  function  f in  an  interval 
J to  its  average  over  the  interval  and  an  assumed  Lipschitz  condition. 

Statement 

If  1)  Interval  J has  width  W 

2)  Density  function  f satisfies 

|f(x)  - f (y) | < L|x  - y|  , x,y«J 

3)  f “ £ J f(x)dx 

" J 

notation  V2(i)  c V^i)  allows  V2(i)  = V^i). 
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then 


and  if 


then 


£5  f<*>  £ f + ¥ 


t<¥ 


UJ5  «*>  S (2W.f)J 


Min 
xe  » 


f(x)  > 0 


Proposition  2* 

Given  the  definitions 


(2.14) 


(2.15) 


Vp/“  * J-1*2 
VW/2  * J‘1>2 

9 « «(1)/W 
»-cb  itub»cb 
-t(Wbub)J-ob]  If  ub  < \ 

♦Because  the  ith  interval  is  understood,  the  "i"  is  dropped  from  the 
notation  when  confusion  does  not  result. 


l 


I 
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Then 

1)  A region  containing  in  the  (Ua>  Ufa)  plane  is: 

»1  ' (Ub  > \ ' C,  - «>  (2.16) 

and 

2)  A region  V containing  V,  in  the  (U  , II)  plane  is: 

<■  <c  a b 

\ = (Ub  " J=1^2  fMax  (0»  " Cj^  ®)  (2.17) 

where  it  is  understood  that  the  definition  of"vi  and*V2  includes 
intersection  with 


(“  < Ua  < /)  n (o  < ub  < i)  . 


Proof 


Part  1 


~ (Pftfa(x)  < ^or  some  **  *) 

c (p  f (x)  < p Max  f (x)'\ 

\ a xcJ  av  ' b xej  1b'xV 


From  (2.14)  and  (2.1 5) 


Min 

xc-9 


r.(*)  » r. 


a 


Max 

x*'9 


fb(x) 


Then 


Part 


Note  that 


Pr(e|i,do 


Note  that 
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(P  f 

a a 


- C 


pbrb  + «)  - <u„  > - ca  - »)  - Vj 


2 

v2  = (yd,  d,  do)  > »(i)) 


= (Er^rW|i,d)  - Pr(c|i,do)] 


> o 


P P 

Pr(e|i,  d)  * 2 b 

) p p 

L J 

j=i 


J^b_ 

Pr(i) 


can  be  expanded  as 


Pr(e|i,do)  - J*  Pr(*|i,  x,  do)f(x|i,  dQ)dx 
J 


r Hi-n 

'j  ?vx) 


f(x|i)dx 


f 
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j“"2  [p3f3<x)i » [“;  Yj<x). 


By  (2.14)  and  (2.15) 


US  rpjfj(x)1  i <0-  °j  - V 


such  that 


PK«U,  dQ)  > [*«  (°»  uj 


c,)] 

J •> 


LUill 

ffxT 


dx 


- W 

Mi) 


Min 

j*l,2 


LMax  (0, 


Then 


[Pr(Hi,d)  - Pr(«|i,dQ)]  < Ub  - [Max  (0,Vj  -C^)] 

and  thus  v2  c V2  • 

Figure  5 illustrates  the  region  while  Figure  6 illustrates 
two  cases  that  result  for  depending  on  the  relative  sizes  of  p 
and  Cfa.  The  region  V over  which  the  density  function  on 
(V  Ub)  is  to  be  integrated  is  obtained  as  the  intersection  of^andV^. 

» - <"b  * u.  - c.  - * > n (“b  * <"*  "j  - V’h 


(2.18) 


(b)  0 > Cb 


Figure  6.  The  lfivent  V 
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4 


Figure  7 illustrates  V for  each  of  the  two  cases,  p < and 
0 > C.  . Note  that  if  p > C.  , then  V = The  transition  from 

p > C.  to  p < C,  is  not  smooth  with  respect  to  the  region  V. 

— D D 

Figure  8 illustrates  the  increment  AV  included  into  V when 

p - C.  is  changed  from  slightly  positive  to  slightly  negative, 
n 


N<?tg  1 

Given 


1) 


2) 


Scaled  beta  density  functions  8* 

U^,  j = 1»2,  according  to  ( 2. 1 1 ) where  Yj1 
are  positive  integers. 


Definition  Be(Yj^, 


r(Yjl  +Vj2} 


j2, 

and  Y 


on 


J2 


An  upper  bound  for 


Pr[Q(i,d,do)  > or(i)3 

is  obtained  by  integrating  the  d.f.  S*(ua(val»  \2^*^Ub^bl’  ^b2^ 
over  the  region  V (i)  and  hence  over  any  region  containing  V(i)  in 
the  (U^,  Ub)  plane.  By  inspection  of  Figure  7 and  by  definition  of 
8*. 


Pr(v(i))<  T(i) 


W 


HP 


K 


I Vi 


bl»  Vb2 


van  U.  -q, 

u b 


P 

J 

W 


t-*C, 


,wu 


AP 


Val*  W^b 
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whore 


q = 0 - Ca  if  p > Cb 

* ' Ca  - Cb  if  ’ < Cb 


When  the  v'e  are  integers.  Appendix  B carries  out  the  integration 


with  the  result: 


Y.o-1 


*2“  - -i, 


pk/.i+J 


T(1)  - Be~‘(vbl,  Vb2)B.-1(y.1>  y^)  J ( »2  ) 

J=0  j 


a 


(Yai+j) 


r„«i  y +j.y  ^ 

■ v (,v*1+V-!S)*1  r jjk  -'V  -v  J 

io  * A v ^ ^ — 


V1 


1 - V - ( “ )<-1>k  ?>!,  ) (2-19) 

k-0  * bl 


where 

- *»  >*  (t  * o.  •)»  5^] 

The  computations  for  T(i)  are  time  consuming  and  subject  to  accumu- 
lated error.  A simplifying  approximation  is  to  approximate  the  d.f.'s 
on  the  U^'s  with  Gaussian  d.f.'s  having  the  same  means  and  variances. 
This  approximation  is  suggested  by  the  fact  that  as  its  parameters  get 
large  while  maintaining  constant  ratio,  a beta  d.f.  converges  pointwise 
to  the  Gaussian  d.f.  having  the  same  mean  and  variance  f36,37]. 
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Note  2 


Given 


(1)  Gaussian  density  functions 


UJ -“Jn2 


g(V“j- 


-i(^) 

a 


- • < X < • 


on  Uj,  j = 1,2. 

(2)  The  Ufc  intercept  and  the  slope  *2  of  a straight  line 

U.  « 5,  + supporting  the  region  V(i)  in  the(U  , U,  ) plane, 

b 1 a a o 

An  upper  bound  A(i)  for 

Pr[Q(i,  d,  dQ)  > »(i)] 
is  obtained  by  integrating  the  d.f. 

g(Ua,ua*  CTb} 

over  the  half -plane  supported  by 


Ub  - ?1  + *2Ua 


Pr(v(i))  < A(i)  -// 


«(Ublub^2)f<U4lu.,!.f)dU.dU|) 


Appendix  C carries  out  the  integration  with  the  result: 


f 
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A(i)  = * 


-5i  -^2  +- 


HI 


V,  _ V?  . 2 \*  w 

S\h)  +ab) 


(2.20) 


where 

„x  1 - £y2 

♦ (x)  = j e dy  -•  < x < • 

Because  of  its  simplicity,  A ( i ) of  (2.20)  is  a more  practical 

result  than  T(i)  of  (2.19).  In  the  remainder  of  this  report  A(i)  is 
used  in  place  of  T(i),  and  the  i*1^  interval  condition  is  satisfied 
(approximately)  if 

A(i)  < t ( i ) (2.21) 

The  ith  interval  is  said  to  be  classified  if  (2.21)  holds;  the  whole 
domain  is  said  to  be  classified  if  (2.21)  holds  for  each  interval. 

Because  A(i)  is  obtained  as  the  integral  over  a region  of  integration 
that  contains  the  one  used  for  T(i),  the  approximation  A(i)  for  T(i) 
tends  to  be  conservative.  Appendix  E contains  comparisons  of  A(i)  and 
T(i)  for  some  special  cases  in  which  the  regions  of  integration  are 
identical.  Good  agreement  is  observed. 

A is  a function  of  the  supporting  line  + ?2^a’  Pro^em 

of  minimizing  A with  respect  to  the  parameters  ^ and  %2  is  now  consid- 
ered. This  minimization  is  subject  to  the  constraint  that  the  line 
Ub  - + ?2Ua  ““PP01^8  the  r«Rlon  V(i).  If  the  mean  (ua,  ufa)  is  inV(i), 

A is  given  the  value  1,  and  no  minimization  is  attempted.  In  the  follow- 
ing minimization,  it  is  assumed  that  (i*a,  ufc)  is  not  inV(i).  Because  * 


-il- 


ia monotonically  increasing  in  its  argument,  minimization  of  A is 
accomplished  by  minimizing  the  argument  of  * . 

Case  1 

p > cb 

From  Figure  7b,  it  is  clear  that  only  lines  through  the  point 

(U  , U,  ) = (C  , p ) need  be  considered.  For  such  lines  the  U. 
a*  b a b 

intercept  f can  be  written  in  terms  of  the  slope  as 

?i=p  -ca?2 

A straight-forward  minimization  of  the  argument  of  • in  (2.20)  with 
respect  to  52  subject  to  the  constraint  that  ? 2 is  in  the  range 
[0,11  leads  to  the  value  of  A(i)  computed  according  to  the  flow 
diagram  of  Figure  9.  The  requirement  §2  in  [0,1]  ensures  that  the 
line  supports  V(i). 

Case  2 

p < cb 

From  Figure  7a,  it  is  determined  that  A(i)  is  minimized  for  a line 
through  the  point  ( U a,  U^)  * (C^  + 2/oCb  , p ) or  for  a line  tangent 
to  the  quadratic  portion  of  the  boundary  curve  to  V(i).  Minimization 
of  A(i)  with  respect  to  lines  through  (C&  + 24>Cb)  , p ) is  accomplished 
similarly  to  the  minimization  for  Case  1 except  that  is  given  in 
terms  of  {2  by 

Sl-»  - <C.  * z£\)',2 

with  1 2 constrained  to  the  range  [0,  Jo  /c^ . Minimization  of  A(i)  with 
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i 


Figure  9 • Flow  Diagram,  p > C, 


uu  - 


respect  to  tangent  lines  to  the  quadratic  boundary  of  V(i)  requires 
an  iterative  process  described  in  Appendix  D.  The  defining  parameter 
is  the  U coordinate  X for  the  point  of  tangency.  Given  the  value 

& 8i 

X , one  can  obtain  ? and  ?_  from 

A 1 

X - C 

• =,  _A & 

2 2C. 

b 


^Cb 


The  constraint  that  X is  in  the  range  Tc  + 2,/oC.  , C + 2C,  1 

a a b a b 

ensures  that  the  tangent  line  supports V(i).  The  overall  procedure 
leads  to  a value  for  A(i)  computed  according  to  the  flow  diagram  of 
Figure  10. 


2.6  Conditions  for  Domain  Classification 

It  is  of  interest  to  know  conditions  for  which  the  whole  domain 
can  be  classified. 

Proposition  3 

Let  Hp  be  the  total  width  of  the  domain.  Restrict  y and  6 by 


0 < or  < 1 
0 < 0 < 1 

Allocate  or  and  1 - 0 to  the  intervals  according  to 
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T<1)  ■ ^ (1  "3) 


and  define  p by 


P W(i)  Hj. 

Let  an  R-interval  partition  of  the  domain  be  given  with  each  interval 
width  W(i),  satisfying 


0 < W(i)  < W 


v.**"  te? ~ . •Jrrhr  • «r 

J=l,2  (PjV 


Let  nj  training  observations  from  Class  j = 1,2,  be  used  to  form 
a decision  rule  d as  discussed  previously  in  this  chapter. 


Then: 


. p2  p2  i 2wf  i 1 - litii  Max  /pi  ^ 

/JL  + A.r<  Min  2W(i)'v°  2 >1.2  lP1LV/ 

"V 2 n2+2^  .-l/w(i) 


(2.22) 


implies  p > C. 

*£»  «-*) 
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implies 


A(i)<T(i)  , i=l,2,...,R 

the  requirement  for  classification  of  the  domain. 

Proof 

Consider  the  1th  interval.  By  hypothesis,  the  case  o > of 
the  previous  analysis  applies.  The  event  V(i)  for  that  case  is 
illustrated  by  the  cross-hatched  region  of  Figure  11.  Each  point 
(Ua,  U^)  in  the  region  defining  the  event  V(i)  satisfies 

Ub>  Ua  4 0 " Ca  (2.23) 

The  line  given  by 


ub  = u.+o  -Ca 

supports  the  region  V(i)  and  is  one  of  those  considered  for  the  best 
such  support  in  the  computation  of  A(i).  If  A^(i)  is  the  integral  of 
the  approximating  joint  Gaussian  d.f.  over  the  half  plane  defined  by 
(2.23),  then 


A(i)  < Ax(i) 

From  (2.21),  using  = 0-  and  S2  “ 1» 


Aj(i) 


-P  + 


(o 


+ u 


Thus,  satisfaction  of 
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. r-°  +ca  -y+*b-i  m 

* - , 2 2.J J < T(i) 

(ja  +<7b) 


assures  classification  of  the  ith  interval.  Equivalently,  because 
*(x)  is  monotonically  increasing  in  x,  one  can  write 


_ - o + C - i*  + u , 
+ Jb} 


(2.24) 


where  * 1 is  the  inverse  of  * . Note  that 


M**  c > C 

J-1,2  CJ  -Ca 


and 


» u, 


Then 


_ . Max  „ 

~ 0 + „ G 


(oa  + a*)* 


[T(i)J 


implies  (2.24).  By  hypothesis  T(i)  < 4 so  that  #"1[‘r(i)]  < 0. 
Rearranging  gives  (a^,  o^  is  a permutation  of  o^,  a,,) 


. . Max  _ 

* l>(i)J 


(2.25) 


From  (2.12) 
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2 

°3 


-11 


^11 


(Vjl  + vJ2  +1) 


But  the  product  of  two  numbers  that  sum  to  1 ie  bounded  by  £.  Thus 


°J  - izdti.))  (yjX  + Yj2  + l)  * (n^  + 2) 


2 

with  this  bound  on  Oy  the  inequality 

1/1  . 2 y ^ 

MCUkZ-Tz  + n2  + 2)  K 

#‘^(1)] 

Implies  (2.25).  Appropriate  substitutions  give 


/ pi  + 4 )j  < ^ SU  <PM' 

^nl  + 2 n2  + ^ -#"1  (1-3)] 


(2.26) 


Thus  (2.22)  implies  that 


A(i)  < r(i)  , i “ R 


It  is  interesting  that  one  can  specify  - before  taking  any 
training  observations  - a satisfactory  partition  and  the  number  of 
training  observations  that  assure  classification  of  the  whole  domain. 
Consider,  for  example,  the  special  case  in  which: 
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pi  - P2  " i 

L1  = L2  “ L 

W_  « 1 
T 

• * 0.1 

0 = 0.9 

W(i)  - W,  i = 1,...,  R 
nl"n2 

Then  (2.22)  simplifies  to 

V 2 RifeS*  • 2 • J-l«2 

where  W mist  satisfy  0 < W < . The  smallest  that  satisfies 

this  inequality  is  plotted  in  Figure  12  as  a function  of  W for  each 
of  several  L values. 

Several  observations  concerning  Proposition  3 can  be  nmde. 

1)  The  numbers  n^  and  required  for  satisfaction  of  (2.22) 
are  generally  very  large.  This  is  to  be  expected  because  the  propo- 
sition states  a result  that  does  not  use  the  values  Yj2* 
Regardless  of  these  values  the  result  is  applicable.  Suppose  that 
training  observations  and  hence  Yj^t  Vj2,  »re  available  for  the  ith 
interval;  hence  ua  and  can  be  determined  for  the  interval. 
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Suppose  further  that  ua  and  remain  constant  as  n^,  n 2,  and  V/ 
are  varied.  If  P^  = P2  = = 5,  a = 0.1,  R = 0.9,  and 

n^  --  n^,  then  the  minimum  nj  required  to  classify  the  interval  is 
plotted  in  Figure  13  for  several  u&,  values.  Note  that  n^  is 
much  smaller  when  the  parameters  Y^,  Yj2  can  used.  The  next 
chapter  assumes  u&  and  are  constant  over  n^,  n2»  and  W,  so  that 
estimates  of  the  number  of  training  observations  required  to  classify 
the  interval  can  be  obtained.  Adjustment  in  interval  width  is  made 
based  on  these  estimates. 

2)  Maximization  of  the  right  side  of  (2.22)  with  respect  to 
the  interval  widths  allows  widths  to  be  chosen  that  correspond  to  the 
smallest  values  n^  and  n2  that  satisfy  (2.22)  . Such  "best"  interval 
widths  correspond  to  a "best"  number  of  intervals.  Hughes  ^26],  using 
a mean  recognition  accuracy  criterion,  also  arrives  at  a "best"  number 
of  intervals. 

3)  By  requiring  the  interval  widths  to  be  less  than  or  equal  to 
WMax’  a m*r‘knum  Placed  on  the  number  of  intervals.  It  is  possible 
that  this  minimum  conflicts  with  the  assumed  storage  constraint. 

Also  of  interest  is  the  rate  at  which  the  quantity 


- " + c.  ~ V ^ 
«■£ ♦ $ 


changes  with  n,  where 


54 


Assuming  that  are  constant  with  n,  that 


n = P n » 2 
a a 

% = Pb  n **  2 

and  that 

I 

i 

P 

-4  ^ 

w a 

a. 

W ^ 

the  rate  of  change  in  • (Q)  is  given  by 

9 *(a)  i c i " 
B*te  “ ~ = i 7%  7 T 


where  C is  a constant  given  by 
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This  chanter  has  discussed  the  classification  procedure  for  a 
fixed  R-interval  partition  of  the  domain.  In  Chapter  III  an  ad  hoc 
approach  for  adjusting  the  partition  is  described.  The  objective  is 
to  classify  the  whole  domain  with  as  few  training  observations  as 
possible. 


i 
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CHAPTER  III 
ALTERING  THE  PARTITION 

UL IntrodmUon 

The  classification  procedure  of  Chapter  II  operates  with 
a given  partition.  Generally,  the  partition  can  be  adjusted 
to  decrease  the  number  of  training  observations  required  for 
classification  of  the  whole  domain.  A desirable  adjustment 
procedure  would  be  one  that  minimized  this  number. 

A hill  climbing  technique  could  be  used  to  minimize  an 
estimate  of  the  number  of  training  observations  required  for 
domain  classification.  Similarly,  hill  climbing  techniques 
could  be  used  to  maximize  an  estimate  of  the  divergence  [52], 
an  estimate  of  the  information  contained  in  an  observation 
about  its  unknown  class  [39],  or  any  other  global  measure  of 
the  separation  of  density  functions.  The  hill  climbing  technique 
using  the  first  criterion  mentioned  generally  requires  many 
intervals,  while  using  the  other  criteria,  it  has  not  been  shown 
to  achieve  satisfaction  of  condition  1.6. 

This  chapter  describes  an  ad  hoc  partition  adjustment 
procedure  that  operates  by  sequentially  adjusting  the  widths 
of  unclassified  intervals  in  the  order  that  they  appear  in  a 
table  called  the  priority  table.  The  width  of  an  interval  under 


1 
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consideration  is  adjusted  to  increase  an  estimate  r of  the  inter- 
val's "classification  rate"; 

? = f 

n 

where  p is  an  estimate  of  the  mixture  probability  in  the  interval 
(p  = + P2  P^)  and  n is  an  estimate  of  the  number  of  training 

observations  required  to  classify  the  interval*.  The  estimated 
rate  r is  a reasonable  performance  measure  in  that  it  increases 
with  p and  decreases  with  n.  A possible  disadvantage  is  that 
it  is  local  (applies  to  one  interval)  as  opposed  to  being  global 
(applies  to  all  intervals);  i.e.  partition  adjustment  using  a 
global  measure  may  result  in  a smaller  estimated  number  of 
training  observations  required  for  classification.  Application 
of  a global  technique  would  need  to  constrain  the  partition 
so  that  it  allows  classification  of  the  intervals. 

With  suitable  approximations  (to  be  listed)  p,  n,  and  thus 
r can  be  written  as  functions  of  the  interval  width  w'  (in  this 
chapter  notation  with  a prime  refers  to  variable  quantities, 
whereas  unprimed  notation  refers  to  observed  quantities). 
r(W')  can  be  maximized  with  respect  to  W*;  the  resulting  maximum 
is  denoted  r^,  and  the  interval  width  giving  this  maximum  is 
denoted  WM>  The  intervals  are  listed  in  the  priority  table 
in  order  of  decreasing  ?M  values. 

*The  notation  omits  reference  to  a particular  interval. 
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The  following  items  constrain  the  adjustment  procedure: 

1)  A change  in  the  size  of  an  interval  influences  the 
sizes  and  or  number  of  other  intervals.  A change  of 
an  interval's  size  is  not  allowed  if  it  affects  an 
interval  preceding  it  in  the  priority  table. 

2)  Rules  are  stated  that  determine  if  an  interval  adjustment 
is  made.  An  interval  is  adjusted  either  by  splitting 

it  into  two  intervals  or  by  combining  it  with  an 
adjacent  interval.  These  types  of  adjustments  allow 
for  a reasonable  amount  of  change  at  each  adjustment 
stage  and  for  larger  changes  over  several  adjustment 
stages. 

3)  No  more  than  R intervals  are  allowed  in  the  partition 
at  any  one  time. 

U)  After  a partition  adjustment,  the  beta  distributions* 
on  the  P's  are  reinitialized. 

Rate  Rati —tee 

The  estimate  r(Vf')  is  obtained  by  first  obtaining  p(w') 
and  r(w').  The  following  simplifying  approximations  are  useful. 
The  quality  of  these  approximations  affects  only  the  partition 
adjustment  procedure  and  not  classification  based  on  a given 
partition. 


Note  that  even  though  Gaussian  approximations  are  used  for 
computations,  all  updating  and  reinitial ization  is  done  with 
beta  distributions. 
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ApprofflaaMgnfl 


«f  vji : Vir’Q) 

Then 


^4  Vj*|  ^4  Vn  /^jv 

J VT  (nj  + 1)  W (nj  + l)  Vnj/ 


- ( y,)l  \ 

= w • (jrpn;  - ^ 


and 


»*K\T  ~ ^ )~  ^ ~ 

ifn'  + 2)  n.  + 2 


2>"  "j 5 -j  •' 

-»ttt  / N - 

3)  (vT'^y^vT 


Then 


Ta'r“J  PJ-  “.i 

1 Wnj  Wn 


Yji  is  the  number  of  training  observations^in  the  interval  from  the 
the  jth  class.  It  is  a' turned  that  nj  + 1 = nj. 

**The  number  of  training  o aervations  from  the  2 classes  are  assumed 
proportional  to  the  a priori  class  probabilities. 

^Number  of  training  observations  in  the  interval  is  assumed  email 
compared  with  the  total  number  from  the  class.  Also  sub- 
sequently assumed  is  nj  + 2 = nj. 


■ 


yasAt" 
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With  these  approximations,  p(w')  is  easily  computed  as 


p(w')  = W'(u&  + M^) 


(3.1) 


Let  a and  (1  -0)  be  allocated  to  the  intervals  according  to 

«(i)  = ^ « 

WT 

x(i)  = (1  - 3)  (3.2) 

WT 


Note  that 


is  constant  with  W#(i).  Computation  of  ?(W'(i))  proceeds  for 
the  i interval  by  using  the  above  approximations  in  the  i 
interval  condition 


I-  5,  - 


*2  + 


“b 


((<* 


y2 + i? 


< T 


(3.3) 


for  suitable  values  of  and  ^ (i  has  been  dropped  from  the 
notation). 

Taking  the  inverse  gives 


~ ;i  ~ \ ^ 
((..  52)5  . 


< f"1  (T) 


(3.4) 
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I 


llsint;  the  variable  quantities  and  the  above  approximations  leads 

to 


" h “ “a  h + ^b  _ .-1  /w'  „ 

* /.■  * ; i?  * ^(1-e>> 


W*  (ua  ?2  + %y 


or  assuming  that 


jjj-  (1  - B)  < h 

WT 


and 


tt 


?1  + *a  *2  ' “b 


> 0 


The  inequality  becomes 


. - (“a  *2  + “bX 

" *1 — i r — ) 


ft  - «) 


51  + “a  5?  ‘ % 


The  estimate  n(w')  is  taken  as 


n(W  »51,?2)  ( vT  ) 


- t 


*(£  ft  - « ) 


?1  + *a  *2  “ “b 


(3.5) 


This  is  a constraint  on  W*. 


tt 


Assumes  (ua,Ub)  Is  below  the  supporting  line  for  V(i)  in  the 
(nai,:b)  Plane. 


i 


4 


m A%}-  W » -i J 
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where  have  been  included  in  the  notation  to  indicate  dependence 

on  these  quantities.  From  (3-1)  and  (3.5) 


Hw',5.,!.) *-5-2! j ■ « ; 7 

<».  4 + “k>  - • Xs:  « - s>) 

T J 


ST  <1  - 9)  < i 
T 


'5,  + u - ik  > o 
1*1  a *2  % 


(3.6) 


To  obtain  ?M,  the  quantity  p(W,,?1,^2)  should  be  maximized  over 

all  values  ^or  a ^^I'e  \ = ^1  + ^2  ^a  auPP°rts  the 

region  V(i)  and  over  all  interval  widths  W#.  For  simplification 
is  maximized  over  w'  for  each  of  three  sets  of 
values,  and  then  the  maximum  of  these  is  chosen  for  r^.  It  is 
now  assumed  that 


g-  (1  - 9)  < < * 

WT 


Then  for  maximization  of  r (W y , ^ , ^ ) with  respect  to  w',  the 
quantity  - ft  X(~^  (1  - B)J  is  considered  to  be  approximately 


constant . 


Case  1 (S^Sg)  * (p,0),  p > ^ 


, + 1"  W<(P  “ O 

r(w  ,p,o)  - ( a ■ -k; IjTTTi — — 
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Thus  rtW^p.O)  is  maximum  for  W'  as  large  as  possible.  To  avoid 
problems  with  poor  approximation  accuracy  for  large  w',  the  value 
WM(p,0)  is  defined  as  the  current  width* 

tfM(p,0)  = W 

and 

?M(p,0)  is  defined  by 


,/u* + ^ 

w(»  - BfcJ  1 

t “b  ' 

L- 

$ p - «) 

The  rules  presented  shortly  for  adjusting  intervals  encourage 
combining  an  interval  with  another  when  > W which  is  true  in 
this  case. 

£aafl_2  (5^5.,)  = (p  - c^,i),  p > 

or 


( W " (p  “ ~ f fl)>  W'  * 


w'[w'  - 2( 

P + “a  “ ^ 

0 

p il 

a a 

P L ; 

a a 

2 

*_1 

Qfc  p - »>) 

- 

This  is  ad  hoc  and  another  value  for  WM  (p,0)  could  be  used. 
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5(0-c  . . » * - “b  , J^L. 

WM(p  V1*  PaL&  PaLa  PbLb 


if  jl*_  < 

PbV  PbV  PaLa 


?..(p  - C ,1)  is  obtained  by  substituting  Wu(p  - C ,1)  for  W'  in 
m a m a 

r(w',p  - Ca,l). 

Caaft  1 52)  = ( “ ca  ~ <vD.  0 < cb 


<V«2>  - (-  'i <p. L. + pb  V-1)-  » ffi; 


f(w',  - Ca  - Cb,l) 


•"<£  <p  - ») 


«u(-  C - C.  ,1) 


^a  " ^ 


M ^ PaLa  + PbH  Pb^  PaLa  + PbS 


ru(-  C - C.  ,1)  is  obtained  by  substituting  Wu(-  C - C.,1)  for 
nab  nab 

W7  in  r(w',-  C - Cb,l).  The  rate  for  Case  2 is  at  least  as  large 

2a  l*a  ” 

as  that  for  Case  3 when  > p — L~ TpiT  * tplU8  obvi*ting 

bo  a a b d 

the  need  to  consider  Case  3 in  that  event.  ?M  is  set  to  the  max- 
imum of  rM(p,0),  ?M(p  ~ Ca,l),  and  ?M(-  Cft  - Cb,l)  and  WM  is  the 
corresponding  interval  width.  Similar  computations  for  each  of 
the  intervals  and  a subsequent  ranking  according  to  r^  values 
gives  the  priority  table. 
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U Interval  Operations 

The  unclassified  intervals  are  considered  sequentially  in 
the  order  that  they  appear  in  the  priority  table.  Suppose  that 
attention  is  centered  on  the  itfl  interval  of  the  priority  table. 
Operations  assumed  available  to  change  its  size  are: 

1)  Do  not  alter  the  interval. 

2)  Split  the  interval  into  two  equal  intervals. 

3)  Combine  the  interval  with  an  adjacent  interval. 

A 

If  W„  > W,  an  attempt  is  made  to  combine  the  interval  with 

A 

an  adjacert  interval;  if  W 2 vW^,  an  attempt  is  made  to  split 
the  interval  into  two  equal  intervals;  otherwise,  no  interval 
change  is  attempted.  The  constant  v was  arbitrarily  chosen  as 
1.6  (values  of  v closer  to  one  can  cause  too  frequent  interval 
changing).  Without  a constant  v > 1,  an  interval  might  never 
be  classified  because  of  alternate  splitting  and  combining  from 
one  set  of  training  observations  to  the  next. 

Combining 

The  following  is  a list  of  conditions  that  must  be  satisfied 
before  the  ith  interval  is  combined  with  an  adjacent  interval. 

a)  The  adjacent  interval  is  unclassified  and  appears  in 
a lower  position  of  the  priority  table  than  the  ith 
interval. 

b)  > W for  the  adjacent  interval  as  well  as  for  the 
itfl  interval. 
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c)  The  adjacent  interval  is  tentatively  classified  to  the 
same  class  as  the  ith  interval;  that  is,  "a"  and  "b" 
in  and  for  the  adjacent  interval  are  the  same 

as  those  for  the  i^  interval. 

d)  The  estimated  "classification  rate"  for  the  combined 
interval  is  greater  than  the  sum  of  the  rates  for  the 
component  intervals  considered  separately.  Before  the 
combined  interval  rate  can  be  obtained,  parameters 
characterizing  the  interval  are  computed  by  a procedure 
discussed  in  the  next  section. 

e)  If  both  intervals  adjacent  to  the  i*'*1  interval  satisfy 
these  conditions,  then  combining  is  performed  with  the 
adjacent  interval  giving  the  largest  improvement  in 
classification  rate. 

If  combining  takes  place,  then  for  the  combined  interval 
is  taken  as  the  average  of  the  values  for  its  component 
intervals.  The  combined  interval  takes  the  place  of  the  ith 
interval  in  the  priority  table  and  is  processed  again  in  exactly 
the  same  fashion.  The  adjacent  interval  that  is  combined  is 
removed  from  the  priority  table. 

Splitting 

In  order  for  the  i^  interval  to  be  split,  the  total  number 
of  intervals  must  be  less  than  R.  If  not,  then  a search  is  made 
for  an  adjacent  pair  of  intervals  that  can  first  be  combined. 

The  search  is  made  in  the  following  order. 
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a)  A pair  of  adjacent  classified  intervals,  classified 

to  the  same  class,  is  sought  and  if  found  are  combined 
together.  Otherwise, 

b)  A pair  of  unclassified  adjacent  intervals  appearing  in 
positions  lower  than  i in  the  priority  table  is  sought 
and  if  found  they  are  combined  together. 

The  ifc^  interval  is  split  if  the  total  number  of  intervals 
is  less,  or  can  be  made  less  by  Items  a)  and  b),  than  R.  In 
that  case  the  ith  interval  is  split  and  splitting  is  3aid  to  be 
successful;  otherwise,  it  is  unsuccessful.  If  splitting  is  not 
successful,  then  no  additional  interval  changes  are  possible 
at  that  stage,  and  the  partition  adjustment  phase  is  terminated. 

Any  time  an  interval  is  adjusted,  the  parameters  character- 
izing it  are  computed  from  the  technique  discussed  in  the  next 
section.  In  addition,  tables  containing  the  characterizing 
parameters  are  adjusted.  At  the  conclusion  of  adjusting  the 
itl1  interval  in  the  priority  table,  the  (i  + l)9t  interval  is 
considered  unless  the  end  of  the  priority  table  has  been  reached 
or  the  ith  interval  cannot  be  split — in  either  case  the  partition 
adjustment  process  is  terminated.  Figure  14  summarizes  the 
partition  adjustment  procedure. 


Before  the  partition  adjustment,  a beta  d.f.  on  each  of 
the  P's  is  known.  After  the  partition  adjustment,  d.f. 's  for 
those  P's  in  un-altered  intervals  are  the  same  as  before  the 
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adjustment.  For  an  altered  interval,  however,  a record  of  the 
number  of  training  observations  in  the  interval  from  each  class 
is  unavailable.  This  section  derives  characterizing  parameters 
for  beta  d.f.'s  on  the  P's  after  a partition  adjustment,  in 
terms  of  the  expected  values  and  variances  of  the  P's  in  con- 
tributing intervals  before  the  adjustment. 

Interval  Combining  4 

Suppose  that  the  it^  and  (i  + l)8t  intervals  are  combined. 
Consider  just  the  jth  class  and  let  i and  (i  + 1)  denote  the 
ith  and  (i  + l)st  intervals  respectively  with  no  interval 
notation  indicating  the  combined  interval.  Thus, 

P = P(i)  + P(i  + 1) 

EP  = BP(i)  + EP(i  + 1) 

An  upper  bound  on  the  variance  of  P is 

Var  P = Var  P(i)  + Var  P(i  + 1)  + 2(Var  P(i)Var  P(i  + 1))^ 
because  for  any  random  variables  X and  Y 

Var(X  + Y)  = E(X  + Y)2  - E2(X  + Y) 

= Var  X + Var  Y + 2j~  i~|(Var  X Var  Y)^ 

(Var  X Var  Y)5J 

< Var  X + Var  Y + 2 (Var  X Var  Y)^ 


"be  characterizing  parameters  of  the  beta  d.f.  on  P for  the 
•*  'ned  interval  are  obtained  from  Equations  (A. U)  in  Appendix  A. 
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Interval  Splitting 

Again  only  the  class  probabilities  are  considered  with 
no  notational  mention  of  it.  Suppose  that  an  interval  is  split 
into  the  i*'*1  and  (i  + l)8^  intervals.  P is  the  random  variable 
for  the  interval  probability  before  splitting;  P(i)  and  P(i  + 1) 
are  random  variables  for  the  i*^  and  (i  + l)3*’  interval  prob- 
abilities after  splitting.  Because  no  information  is  available 
about  the  variation  across  the  interval  before  splitting,  the 
distributions  on  P(i)  and  P(i  + 1)  after  splitting  should  be 
identical  to  each  other;  thus, 

EP(i)  = y (3.7) 

The  allocation  of  the  probability  P is  not  necessarily  uniform 
across  the  interval.  The  worst  case  for  this  allocation  is 
governed  by  the  Lipschitz  constant  L for  the  class-conditional 
d.f.  Figure  15  illustrates  a worst  case  allotment  of  P to  the 
interval.  The  worst  case  occurs  when  the  actual  d.f.  f for  the 
class  has  its  maximum  absolute  slope  L over  the  interval 
as  shown  in  the  figure.  Then  P(i)  is  given  by 

P(i)  - f + € (3.8) 

where  € for  such  a worst  case  is  given  by 


In  general  € can  vary  over  the  range 


|€|  < | (|)2 


A distribution  governing  the  probability  of  occurrence  of  € 
through  the  range  defined  is  not  known,  but  it  can  be  assumed 
symmetric  about  0,  and  independent  of  the  distribution  on  P. 
Because  of  the  symmetry  about  0,  the  expected  value  of  P(i) 
given  by  (3.8)  is  consistent  with  (3.7).  Because  of  the  assumed 
statistical  independence  of  € and  P the  variance  of  P(i)  is 


Var  P(i)  = £ Var  P + Var  € 

A worst  case  is  when  £ of  the  distribution  of  € is  concentrated 
at  each  of  ± jf(|)2.  Then 

Var  € = (§(|)2)2 


and  for  a worst  case  (highest  variance) 

Var  P(i)  = i Var  P + £(L(j*)2)2  (3.9) 

The  characterizing  parameters  of  the  beta  d.f.  on  P(i)  are  obtained 
by  using  EP(i)  and  Var  P(i)  of  (3.7)  and  (3.9)  in  Equations  (A. 4) 
of  Appendix  A. 


r 
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CHAPTER  TV 

COMPUTER  SIMULATED  RESULTS 

U.l  Introduction 

This  chapter  contains  results  obtained  by  using  an  IBM  1130 
computer  system  to  generate  and  to  process  simulated  data.  The 
simulated  data  is  generated  using  standard  pseudo-random  number 
generation  techniques  (see  e.g.  [/*0]).  Processing  follows  the 

rv 

flow  diagram  of  Figure  3 in  Chapter  I.  An  interval  of  a given 
R-interval  partition  of  £ is  "classified"  if  condition  (2.21) 
of  Chanter  II  is  satisfied.  The  intervals  of  an  initial  R- 
interval  partition  are  defined  using  the  first  R - 1 training 
observations  by  the  technique  described  in  Appendix  A.  The 
Dartition  is  subsequently  adjusted  using  the  procedure  of 
Chapter  III.  The  i*^  interval,  if  not  "classified"  by  satis- 
faction of  (2.21),  can  be  "tentatively  classified"  to  class  u>a 
by  using  (2.12).  Thus,  even  if  all  intervals  are  not  classified, 
tentative  results  are  available  until  they  are  classified. 

L. 2 AUgfifttlon  ol  flf  and  1 - B t,<?  the  Intervals 

Rjcperimentally,  it  was  found  that  assignment  of  «(i)  and 
r(i)  according  to 

1 I 


vaas 


f- 


i 
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t(1)  = (1  - 9)  (4.1) 

"t 

is  not  economical  in  terms  of  the  number  n of  training  observations 
required  for  domain  classification.  The  assignment  (4.1)  is 
equivalent  to 

ar(i)  = or* 

WT 

t(1)  = ^ (1  - 8)*  (4.2) 

WT 

* 

where  or  and  (1  - 8)*  are  the  portions  of  a and  (1-8)  that  have  not 
been  used  for  the  classified  intervals,  and  is  the  cumulative 
length  of  the  unclassified  intervals.  A significant  reduction  in 
n was  experimentally  observed  with  modification  of  (4.2), 

or(i)  = ux  (pu)  ^ or* 

WT 

T(i)  = u2  ^ (1  - 8)*  (4.3) 

WT 

where  u^  (py)  depends  on  an  estimate  of  the  probability*  in 
all  unclassified  intervals  by  the  relation 

Py  is  the  sum  of  estimates  of  mixture  probabilities  in  unclassified 
intervals. 


1 
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ui  {K]  = ti + ^ ~ ti^1  “ V 

and  is  constant,  u,,,  t^,  and  are  experimentally  chosen 
constants;  the  experimental  examples  subsequently  described  use 

u2  = °'5 

= 0.02 

t2  = 0.05/cr 

The  modification  allots  larger  portions  of  a and  (1  - 3)  to  the 
last  regions  classified,  causing  them  to  be  classified  with  less 
difficulty.  The  observed  decrease  in  n is  attributed  to  this 
fact. 


— study  of  ft  Particular  Problem 

Let  f^  and  fp  be  truncated  Gaussian  d.f.'s  given  by 


- * 

r1  (x)  - Kx  e U-1  , 0 < x < 1 

“ 0 , othervd.se 


- £ /x  ~ 0.6\2 

f2  (x)  = K2  e ^ 0,1  ' , 0 < x < 1 

= 0 , otherwise  (U.U) 

where  and  Kp  are  normalization  constants  included  so  that 
f^  and  f2  integrate  to  1,  and  let  the  problem  parameters  be 

\ 
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a = 0.1 

3 = 0.9 

P,  = 0.5,  j = 1,2 
Lj  = 25,  .1  - 1,2 

R = 9 (4.5) 


Before  presenting  results  for  R = 9,  a graphical  illustration 
of  the  partition  changes  for  R = 5 is  presented  in  Figure  16 
for  one  experiment.  In  Figure  16,  a single  horizontal  line 
indicates  the  corresponding  interval  is  tentatively  classified; 
a double  horizontal  line  indicates  the  interval  is  classified. 

The  lines  being  above  or  below  the  axis  indicate  Class  or 
Class  u>2  respectively.  The  result  after  554  training  observations 
is  a classified  domain  with  decision  threshold  at  0.50197  and 
error  probability  of  0.15870.  This  compares  with  an  optimum 
threshold  at  0.50000  and  error  probability  of  0.15866. 

For  a comparison  with  well  known  parametric  techniques 
(see  e.g.  [5]),  assume  it  is  known  that  f^  and  f^  are  Gaussian 
with  standard  deviation  0.1.  Then  the  only  unknown  parameters 
are  the  means.  It  is  easily  shown  that  only  four  observations 
(two  from  each  class)  need  be  taken  to  satisfy  the  condition 

Pr[Pr(e|d)  - Pr(ft|dQ)  < 0.1]  > 0.9 

if  the  d.f.'s  are  given  by  (4.4).  The  additional  a priori  knowl- 
edge drastically  reduces  the  number  of  training  observations 
required . 
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For  comparison  with  a commonly  used  nonparametric  technique, 
experiments  were  performed  using  a nearest  neighbor  classification 
technique.  Each  x in  6 is  assigned  to  the  class  represented 
by  the  nearest  member  in  a set  of  n = n^  + ^ training  observations 
(n^  from  Class  , n^  = ng).  Using  the  d.f.'s  of  (4.4),  100 
experiments  were  performed  for  each  of  several  n values.  For 
each  experiment  n observations  were  taken,  the  nearest  neighbor 
decision  rule  was  obtained,  and  Pr(8|d)  computed.  An  estimate 
of  the  confidence  that  Pr(Cld)  - Pr(£|dQ)  < o'  was  obtained 
by  dividing  the  number  of  experiments  for  which  Pr(ftld)  - Pr(£'do) 

< O'  by  100.  The  curves  in  Figure  17  illustrate  n^  versus  a 
for  8 = 0.1,  0.2,  0.5,  0.8,  0.9.  For  confidence  8 = 0.9  that 
Pr(f-ld)  -Pr(C|do)  < at  = 0.1,  Figure  17  shows  that  slightly 
more  than  100  training  observations  are  required.  Figure  17 
also  shows  that  for  or  somewhat  less  than  0.1,  say  or  = 0.05, 
the  confidence  8 = 0.9  would  never  be  attained  from  the  nearest 
neighbor  rule.  This  is  in  agreement  with  the  work  of  Fix 
and  Hodges  ? 55]  who  show  that 

Pr(e!d)  — ► 0.225 


or 


Pr(«|d)  - Pr(8ldo)  — ► 0.22 5 - 0.159  = 0.068 

when  the  nearest  neighbor  procedure  is  used  on  this  problem.* 

*Fix  and  Hodges  also  show  that  for  the  K nearest  neighbor  rule,  the 
asymptotic  difference  Pr(ftld)  - Pr(ff|d  ) decreases  to  zero  as  K « 
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The  rest  of  this  section  is  concerned  vdth  the  problem  defined 
by  ( 4.4 ) and  (4.5)  unless  stated  otherwise. 

4.3.1  Effect  of  or  and  3 

In  Figure  18  Pr(fc|d)  is  plotted  versus  a at  the  procedure's 
termination  for  several  3 values.  The  solid  lines  are  averages  over  five 
experiments;  the  broken  line  is  the  maximum  over  five  experiments 
for  each  3 value  and  over  all  3 values.  The  result  should  remain 
below  tne  line  Pr(ft|d)  = 0.15866  + or  with  confidence  at  least 
3.  For  this  problem  the  procedure  appears  very  conservative 
because  in  none  of  the  experiments  did  Pr(fi|d)  closely  approach  the 
line  Pr(tjd)  = 0.15866  + or.  A conservative  procedure  can  be 
undesirable  because  the  number  n of  training  observations  for 
domain  classification  may  be  larger  than  the  number  required 
with  a less  conservative  procedure. 

In  Figure  19  an  average  value  of  n for  five  experiments 
is  plotted  versus  or  with  3 as  a parameter. 

4.3.2  Effect  of  Assumed  Lipschitz  Constants 

The  value  25  is  nearly  the  smallest  Lipschitz  constant 
*=  = L that  applies  to  the  functions  fj  of  (4.4).  Decreasing 

L below  25  can  cause  a decrease  in  n;  however,  an  assumption 
of  the  problem  is  then  violated.  Nevertheless,  it  is  interesting 
that  for  the  problem  of  (4.4)  and  (4.5),  reduction  of  L causes 
a reduction  of  n without  causing  Pr(£|d)  to  exceed  an  acceptable 
limit  (Pr(d|dQ)  + or). 
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In  Figure  20  the  maximum  of  Pr(&|d)  over  5 experiments  is  plotted 
versus  or  for  several  values  of  L.  Oily  for  L = 1 and  L = 2 
did  Pr(£|d)  exceed  0.15866  + or,  and  even  for  these  cases,  Pr(t|d) 
for  only  one  of  the  five  experiments  exceeded  that  value  for  a given  or. 
In  Figure  21,  average  values  of  n for  five  experiments  corresponding 
with  the  examples  in  Figure  20  are  plotted  versus  or  with  L as  a para- 
meter. The  results  in  Figure  21  show  that  a priori  knowledge 
of  the  smallest  applicable  values  for  the  Lipschitz  constants 
is  helpful  in  reducing  n.  For  the  problem  considered  it  can  be 
concluded  from  Figure  20  that  violation  of  the  smallest  applicable 
Lipschitz  constants  by  a factor  as  large  as  ten  may  not  prevent 
domain  classification  such  that  condition  (1.6)  is  satisfied. 

A reason  for  the  good  experimental  results  even  with  Lipschitz 
constants  that  are  smaller  than  the  minimum  applicable  values 
is  that  the  maximum  slope  of  f^  occurs  in  just  small  parts  of 
the  domain.  This  suggests  that  a priori  knowledge  consisting 
of  the  maximum  absolute  value  of  the  slope  of  fj(x)  at  each  x€i» 
could  be  used  to  make  the  approach  less  conservative.  Such 
a priori  knowledge  could  be  used  to  define  "local"  Lipschitz 
constants,  different  constants  applicable  for  different  intervals. 

Also  suggested  is  the  possibility  of  adaptively  altering  the 
constants  for  each  interval  based  on  current  results;  adaptation 
could  occur  with  an  operator  interacting  with  a histogram  display 
or  automatically  by  an  estimation  procedure.  Such  a display 
or  estimation  procedure  may  require  local  storage  of  samples  in 
order  to  obtain  an  estimate  of  the  local  rate  of  change  of  the  density. 
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Z».3.3  Effect  of  Signal  to  Noise  Ratio 

A romnonly  used  measure  of  the  separability  of  f^  and  f 2 
when  they  are  Gaussian*  is  the  signal -noise  ratio,  S:N,  which 
can  be  defined  by 

. Mean^  - Mean,,  ^2 
S'N  \Standard  Deviation) 

In  Figure  22,  n averaged  over  five  experiments  is  plotted  versus 
L for  S:N  = 1,2,  and  U. 

I*. 3.4  Effect  of  the  Number  of  Intervals 

In  Figure  23  n averaged  over  five  experiments  is  plotted 
versus  R for  or  = 0.1  and  or  = 0.2.  Not  shown  in  Figure  23  is 
average  n for  R * U because  the  processing  failed  to  terminate 
for  some  experiments.  This  failure  to  terminate  is  caused 
by  lack  of  availability  of  a sufficient  number  of  inter- 
vals for  adjusting  unclassifiable  interval  sizes  into  classifiable 
sizes*  In  the  present  example  failure  with  partitions  having 
four  intervals  occurs  when  one  interval  at  each  end  of  the  domain 
is  classified  leaving  two  adjacent  intervals  in  the  middle. 

It  is  possible  that  neither  of  these  Intervals  in  the  middle 
can  be  classified.  The  partition  adjustment  operations  do  not 
allow  a shift  of  their  common  boundary  other  than  through  a 
combine  operation  and  then  a split  operation.  Such  a pair  of 


*When  the  d.f. 'a  are  not  Gaussian,  this  definition  loses  much  of 
its  appeal. 
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operations  may  not  occur  because  of  the  conditions  of  Section  3.3 
for  combining.  Even  if  it  does  occur,  the  resulting  intervals 
may  not  be  classifiable.  Thus,  for  small  R,  interval  operations 
more  general  than  the  combining  and  splitting  described  in  Section  3.3 
might  be  helpful. 

The  increase  in  average  n with  an  increase  in  R for  large 
R is  explained  by  the  worst  case  technique  of  Section  3.4 
for  reinitializing  interval  parameters  after  combining 
intervals.  The  technique  results  in  a loss  of  information; 
hence,  more  training  observations  are  required.  Figure  23  shows 
experimentally  that  the  best  R for  this  problem  is  about  nine 
or  ten.  R is  not  too  critical  as  long  as  it  is  large  enough. 

If  it  is  chosen  too  large,  the  procedure  automatically  reduces 
the  number  of  intervals  actually  used  by  combining  some  of 
them  together. 

4.3.5  Effect  of  Frequency  of  Computations 

Let  M be  the  number  of  training  observations  used  for  updating 
between  classification  attempts.  In  Figure 24  , n averaged  over 
five  experiments  is  plotted  versus  M for  several  at  values.  Some 
increase  in  n is  noted  for  small  and  for  large  M.  Not  plotted, 
but  perhaps  as  significant,  is  the  fact  that  for  large  M processing 
is  faster  because  computations  are  performed  less  frequently. 

LJk — Multi-Threshold  K«u«nU. 

To  illustrate  the  procedure  for  multi-threshold  problems, 
including  non  Gaussian  problems,  results  of  five  experiments  for 
each  of  two  examples  are  illustrated  in  Figure  25  and  26  respectively. 
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fj(x)  = 1^6  ' 0,1  , 0 < X < 1 


, otherwise 


-<H¥) 

f2(x)  = ^ 6 V 0,2  / ,0<X<1 


, otherwise 


(K^,  are  normalization  constants) 


a = 0.2 

3 = 0.9 

Pj  - 0.5,  J “ 1,2 

4 - 25,  L2  = 7 
R = 13 

Figure  25  shows  the  domain  classification  at  termination,  Pr(d|d), 
and  n for  each  of  five  experiments  for  Example  1.  Also  included 
for  comparison  is  the  optimum  domain  classification  and  Pr(£|dQ). 


Example  2 


. i/X--  0.2>2  _ - QjJtf 

r f \ v r 0.05  ) ^ ^ 0.05  / 1 ^ ^ . 

f1(x)  = |^e  y + e j,  0 < X < 1 


, otherwise 


_ i/2L--.  QJ±\2  _ i/x  - 0.8\2 

» r.  *\  0.05  J . m *\  0.05  / 1 . - . 

'2'x'  “ K2  L®  + ® j,  0 < x < 1 


, otherwise 


i 
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Number  of 
Observations 

712  I 


Optimum  Partition 


0.3868  0.6493 

Pr(f jd)  = 0.34468 


0.3821  0.6307 

P(C|d)  - 0.34386 


0.3438  0.6490 

Pr(fld)  = 0.34384 


0.3683  0.6337 

Pr(e|d)  = 0.34192 


0.3180  0.6427 

Pr(e|d)  - 0.34973 


0.3653 


0.6347 


Pr(e  Id  ) = 0.34186 
o 


Figure  2 5 . Results  for  a 2 Threshold  Problem 
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Number  of 
Observations 


Optimum  Partition 


0.3024  0.4972  0.7128 

Pr(8|d)  = 0.03609 


0.3260  0.5076  0.7015 

Pr(f.|d)  = 0.04232 


0.2838  0.4919  0.6908 

Pr(e|d)  - 0.03866 


0.3047  0.5025  0.7197 

Pr(e|d)  - 0.03874 


0.3027  0.4985  0.7064 

Pr(  ?|d)  - 0.03471 


0.3000  0.5000  0.7000 

Pr(e|do)  - 0.03413 

Figure  26.  Results  for  a 3 Threshold  Problem 
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cr  = 0.1 
0 « 0.9 

Pj  = 0.5,  J = 1,2 
L1  = 50,  j = 1,2 
R = 15 

Figure  26  shows  the  results  for  Example  2.  Note  that  for  Example  2, 
the  d.f.'s  f^  and  f^  cannot  realistically  be  assumed  Gaussian, 
and  thus  a "non  Gaussian"  approach,  such  as  the  current  one, 
should  be  used. 

Summary 

Computer  simulations  verify  for  the  problems  considered  that 
processing  according  to  the  flow  diagram  of  Figure  3,  Chapter  I 
gives  good  results.  An  interval  of  a given  R-interval  partition 
of  6 is  classified  if  condition  (2.21)  is  satisfied. 

Partition  adjustments  are  made  using  the  adjustment  technique 
of  Chapter  III.  First  a one  threshold  problem  is  studied  as 
problem  parameters  are  varied.  Of  particular  interest  is  that 
the  Drocedure  appears  too  conservative;  the  assumed  Lipschitz 
constants  can  be  reduced  significantly  below  the  minimum  applicable 
values.  The  effect  for  the  example  is  to  reduce  the  number  n of 
training  observations  required  without  increasing  the  probability 
of  error  above  an  acceptable  value.  The  following  possible 
modifications  are  suggested: 

1)  A priori  knowledge  consisting  of  the  maximum  absolute  value 
of  the  slope  of  f^(x)  at  each  x€ii  might  be  available. 

Such  knowledge  could  be  used  to  define  "local"  Lipschitz 
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constants  for  the  intervals — different  constants  for 
different  intervals. 

2)  The  Lipschitz  constants  for  each  interval  could  be 

adaptively  altered— either  interactively  by  an  operator 
observing  a histogram  display  or  automatically  by  an 
estimation  procedure.  Such  an  approach  could  lead  to 
a practical  solution  of  the  problem  of  obtaining  the 
a priori  knowledge  required  in  1)  above. 

It  is  noted  that  a drastic  decrease  in  the  number  of  training 
observations  required  can  be  obtained  if  a priori  knowledge 


appropriate  to  parametric  procedures  is  available. 
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CHAPTER  V 


EXTENSION  TO  MULTIDIMENSIONS 


In  the  preceding  chapters  computerized  recognition  is  restricted 
to  a l -dimensional  observation  space.  A mapping  is  now  utilized  to 
extend  the  procedure  to  a i -dimensional  (l  finite)  observation  space. 
The  observation  vector  £ is  in  a bounded  domain  £ of  an  1 -dimensional 
vector  space  V*  where 


{i  = (x^...^)  : 0 < Xj  < 1,  j “1, 


(5.1) 


Density  functions  fj,  J = l,2,  defined  on  £ are  assumed 
to  satisfy  Lipschitz  conditions. 


|fjU)  - £ Lj  II*  - ill  » J * 1»2 


(5.2) 


for  the  norm 


II*  " *11  " ( I (xi  " yi)2)  * 


In  the  previous  chapters,  a procedure  is  developed  for  adjusting  a 
partition  of  a 1-dimensional  observation  space.  In  the  current 
chapter,  an  appropriately  defined  one-to-one  mapping  is  utilized  to 
achieve  a correspondence  between  sets  in  a partition  of  £ and  sets 
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k 


in  a partition  of  an  interval  of  the  real  line.  The  napping  is 
defined  such  that  the  previously  developed  partition  adjustment 
technique  on  the  real  line  can  be  used  to  adjust  the  corresponding 
partition  of  £.  Alternatively,  the  napping  can  be  viewed  as  con- 
verting the  A-diraensional  problem  to  a 1-dimensional  one. 

There  has  been  recent  interest  in  transforming  data  vectors  in 
to  vectors  in  , l'  < t.  One  such  transformation  is  used  to 
display  clusters  of  ^-dimensional  data  vectors  in  V*  , especially 
for  the  case  l'  m2  (a  human  operator  then  can  view  them  for  data 
analysis).  Another  type  of  transformation  is  a one-to-one  map  of 
regions  in  V to  intervals  in  \r  . The  former  type  of  transformation 
is  discussed  in  Section  5.6;  in  its  present  form,  it  is  not  appli- 
cable to  the  partition  adjustment  problem  although  it  may  be  possible 
to  modify  the  transformation.  The  latter  type  of  transformation 
which  will  be  used  for  adjusting  the  partition  is  discussed  in 
Section  5.3. 

5.2  The  Approach 

The  approach  used  to  convert  the  ^-dimensional  partition  adjust- 
ment problem  to  a l-dimensional  problem  involves  the  following  six 
steps : 

1)  Each  dimension  of  the  domain  £ in  V*  is  partitioned  into  bK 
intervals  where  b,  a positive  integer,  is  the  base  for  some  of  the 
arithmetic  computations  that  follow.  The  positive  integer  K deter- 
mines the  number  of  intervals  in  the  partition  and  is  called  the 
complexity.  The  resulting  b regions  in  v are  referred  to  as 
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elementary  regions. 

2)  Similarly,  the  interval 

«=(y  5 0 < y < 1 } 


(5-3) 


K£ 

of  the  real  line  is  partitioned  into  b intervals  referred  to  as 
elementary  intervals. 

3)  A one-to-one  transformation  is  defined  which  maps  the  elemen- 
tary regions  onto  the  elementary  intervals.  In  this  manner,  a par- 
titioned A -dimensional  domain  is  mapped  to  a partitioned  1-dimensional 
domain.  Data  vectors  falling  in  a A -dimensional  elementary  region 
also  fall  in  its  corresponding  1 -dimensional  elementary  interval. 

4)  Approximate  functions  and  h 2 are  defined  for  f and  f2  such 
that  hj,  J*l,2,  is  constant  over  each  of  the  elementary  regions  in 
A.  The  constant  hj  on  any  particular  region  is  taken  to  be  the 
average  of  fj  over  that  region.  Because  f^  and  f2  satisfy  (5.2), 
the  partitioning  can  be  made  fine  enough  so  that  for  practical  pur- 
poses hj  is  equivalent  to  fj,  j~l,2. 

5)  gj,  a piecewise  constant  function,  is  defined  on  the  real  line 
such  that  gj  and  hj  are  equal  over  corresponding  elementary  region  - 
elementary  interval  pairs.  The  interior  content  or  A -dimensional 
volume  of  each  elementary  region  given  by 


Volume 


(5.4) 
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is  equivalent  numerically  to  the  width  of  each  elementary  interval.* 
Thus,  function  gj  integrates  to  1 if  h^  does.  Data  vectors  falling 
in  the  /-dimensional  observation  space  have  the  d.f.  fj  or  practically 
speaking  hj.  These  data  vectors  also  fall  on  the  real  line  where, 
practically  speaking,  they  have  the  d.f.  gj. 

6)  Lipschitz  conditions  introducing  a priori  knowledge  about  the 
d.f.'s  were  utilized  in  the  1 -dimensional  recognition  procedure  consid- 
ered in  previous  chapters.  At  the  beginning  of  this  chapter,  (5»2) 
defines  Lipschitz  conditions  assumed  satisfied  by  the  functions 

J = l,2,  on  the  /-dimensional  domain.  The  following  concerns  the 
problem  of  utilizing  the  a priori  knowledge  contained  in  these  con- 
ditions in  such  a way  that  the  1-dimensional  procedure  may  be  employed 
with  the  current  /-dimensional  problem.  This  involves  obtaining  con- 
stants L * to  be  used  in  defining  constraints  on  g,. 


|gj(x')  - g^(y')|  < y'l 


for  x'/y'  where  x'  and  y'  are  mid-points 
of  any  2 elementary  intervals  in  9. 


(5.5) 


Equation  (5*5)  can  be  thought  of  as  a "pseudo-Lipschitz"  condition 
on  the  function  gj.  The  procedure  for  obtaining  Lj*  requires  that 
the  transformation  discussed  in  Item  3»  above  relates  each  pair  of 
adjacent  elementary  intervals  in  9 with  a pair  of  adjacent  elementary 
regions  in  £.  Then,  the  maximum  change  in  gj  from  any  elementary 

*The  partitioning  is  assumed  to  be  such  that  all  elementary  intervals 
are  the  same  size  and  all  elementary  reg  one  are  the  same  size  and 
shape. 
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interval  to  an  adjacent  elementary  interval  occurs  when  the  function  f . 
changes  at  its  maximum  rate  in  the  direction  of  the  line  joining  the 
mid-points  of  the  corresponding  adjacent  elementary  regions  (see 
Figure  27).  The  change  in  gj  is  bounded  by  the  relation: 

|gj(x')  - gj(y')|  < Lj||x  -i!|  (5.6) 


where  x and  £ are  the  mid-points  of  the  adjacent  elementary  regions  that 
correspond  to  the  adjacent  elementary  intervals  whose  mid-points  are  x' 
and  y'.  Using  (5.5)  gives  Lj*  in  terms  of  Lj : 


K(£  - 1) 


(5.7) 


Recall  that  the  functions  g^  are,  for  practical  purposes,  d.f.'s 
governing  the  l-dimensional  mapped  observations.  Treating  the  constants 

L * as  Lipschitz  constants  for  functions  g.,  one  can  use  the  l-<iimens1  onal 

J J 

recognition  procedure  developed  in  previous  chapters.  It  operates  on  the 
mapped  training  observations  to  obtain  a solution  in  ft.  The  solution 
consists  of  a partition  of  ft  with  each  interval  assigned  to  one  class  or 
the  other.  The  t -dimensional  solution  can  be  obtained  by  assigning  «ach 
elementary  region  in  to  the  class  assigned  to  its  corresponding  elemen- 
tary interval  in  A. 

One  could  avoid  the  conversion  in  (5.7)  above  by  treating  the  1-dimen- 
sional  mapped  training  observations  as  though  they  were  the  original  data 
and  assuming  Lipschitz  conditions  on  d.f.'s  for  this  data.  Such  assumed 
Lipschitz  conditions  are  open  to  question,  but,  inpractice,  so  are  the 
ones  given  by  (5.2)  on  the  original  functions. 
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Figure  27.  The  Functions  fj,  hj,  and 


f 


- 102  - 


The  transformation  to  be  discussed  in  Section  5.3  doe3  not  depend 
on  the  data;  transformations  that  are  data  dependent  could  profitably 
be  used.  For  example  the  correspondence  between  elementary  regions  and 
elementary  intervals  could  be  defined  to  minimize  increments  in  the 
estimated  d.f.'s  between  adjacent  elementary  intervals. 

Additional  improvement  could  be  attained  by  subdividing  the  obser- 
vation space  by  a clustering  T23,  56]  or  other  technique  to  isolate 
modes  of  the  d.f.'s.  Each  subdivision  can  then  be  treated  as  the  domain 
of  a separate  problem  for  subsequent  partitioning  and  mapping  to  the 
real  line. 


5.3  Mapping  to  One  Dimension 

This  section  describes  mappings  that  map  elementary  regions  in  <1 
one-to-one  onto  elementary  intervals  in  9.  First,  the  elementary  regions 
and  elementary  intervals  are  defined  more  clearly. 

The  ^-dimensional  elementary  region  S (b,K,t)  is 

®l*e2»  •••»  e£ 

defined  by 


S 

Ve2» 


• * * 9 


(b,K,j 0 = jx  : "4  < x 

l bK 


J 1,2,  ...,  fcl 


kK 
b x . 


V1’ 


(5.8) 

K£ 

All  b elementary  regions  are  defined  by  allowing  each  of  the  sub- 
scripts ej,  j = l,  ...,  I,  in  (5.8)  to  take  on  each  of  the  values 
0,1,2,  ...,  bK  - 1.  S (b,K,A)  is  tha  set  of  all  x in  fl 

W •••»  ®e 

that  become  identical  if  each  of  its  l components  is  expressed  in 
the  base  b number  system  and  truncated  to  K digits. 

Similarly  the  elementary  interval  S#(b,Kl)  is  defined  by 
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s.(b,Kt)  - {y  ■ Hh  £ y < 

D D 

■ (y  : e < b Kty  < e + l]  (5.9) 

All  b* f elementary  intervale  are  defined  by  allowing  the  subscript 
e in  (5.9)  to  take  on  each  of  the  values  0,1,2,  ...,  b^*-l.  S (b,K£) 
is  the  set  of  all  y in  R that  became  identical  if  expressed  in  the 
base  b number  system  and  truncated  to  K 9.  digits. 

Both  the  elementary  regions  and  the  elementary  Intervals  are 
uniquely  identified  by  their  subscripts.  Hence  the  mappings  can  be 
defined  via  the  subscripts. 


5.3.1  The  Dovetail  Napping 

Consider  mapping  the  arbitrary  elementary  region 
S (b,K,£)  to  an  elementary  Interval  in  ft.  The  base  b 

w •••»  •« 

representation  of  the  subscript  e^,  j-1,  ...,  |,  is 


®J  " ^1^2  * * * a)K 


. K - 1 . K -2  , . . 0 

*Jlb  + °’j2b  + * * ‘ + ajKb 


(5.10) 


where  each  a^i-l,  ...,  K,  Is  one  of  the  values  0,1,  ...,  b-1. 
The  Dovetail  Mapping  defines  the  corresponding  subscript  e by 
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e =*  °>l°'l2ar22  *"  nl2 ^lK^K  *'*  ° «K 

. K f - 1 . K / - 2 , .0 

= o-nb  +»21b  +0-^  ( 5.11) 

This  mapping  has  been  used  in  integration  theory  (see  e.g.  Wiener 
[ill).  It  is  called  the  Dovetail  Mapping  because  it  interleaves  or 
dovetails  the  digits  of  the  e^'s  to  get  «• 

Fbcftgplft  Figure  28  illustrates  an  example  with  b = 3 , K*2,  and  / = 2. 
The  ordering  imposed  on  the  elementary  regions  in  * through  the 
Dovetail  Mapping  by  the  natural  ordering  of  the  elementary  intervals 
in  ® is  illustrated  with  an  ordering  path.  The  dotted  line  portions 
of  the  ordering  path  denote  discontinuities  in  the  path.  Several 
corresponding  elementary  regions  in  0 and  elementary  intervals  in 
P have  been  labeled  with  corresponding  letters  in  the  Figure. 

5.3.2  The  Column  Mapping 

A problem  with  the  Dovetail  Mapping  is  the  discontinuous  way 
in  which  it  orders  the  elementary  regions  in  6 by  the  natural  order- 
ing of  elementary  intervals  in  ft.  For  example,  in  Figure  28,  adja- 
cent elementary  intervals  F^  and  G1  in  ft  correspond  to  the  widely 
separated  elementary  regions  F2  and  G2  in  4.  Because  of  the  discon- 
tinuities, (5.7)  cannot  be  used  to  obtain  constants  L^*  for  pseudo- 
Lipschitz  conditions  on  gj. 

It  is  possible  to  modify  the  Dovetail  Mapping  to  remove  the 
discontinuities  in  its  ordering  path.  By  inverting  the  ordering  of 
the  elementary  regions  in  suitably  defined  regions  of  6,  the  mapping 
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Figur®  29.  Colum  Mapping  for  b * 3,  k = 2,  and  4 = 2 
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illustrated  geometrically  by  Figure  28,  for  example,  can  be  converted 
to  the  mapping  illustrated  by  Figure  29.  Note  that  the  ordering 
path  in  Figure  29  is  continuous  which  implies  that  each  pair  of 
adjacent  elementary  intervals  in  d corresponds  to  a pair  of  adja- 
cent elementary  regions  in  0.  The  mapping  illustrated  in  Figure  29 
is  an  example  of  what  is  called  here  a "Column  Mapping".  Whereas 
b for  the  Dovetail  Mapping  can  be  any  positive  integer,  it  is 
restricted  to  be  an  odd  positive  integer  for  the  Column  Mapping. 

Again  (5.10)  is  used  to  represent  the  subscripts  ej  for  the  elemen- 
tary region  S (b,K,f).  The  Column  Mapping  is  defined 

algebraically  by  writing, 


011021  *•'  0*1P12P22  *'•  Bf2 


P1K82K  •**  P«K  (5-12> 


Each  is  either  equal  to  or  b - 1 -a^.  Which  it  is  depends 
on  whether  or  not  an  ordering  inversion  as  mentioned  above  is 
required  for  the  region  defined  by  those  digits  in  (5.11)  that  are 
more  significant  than  o^.  A method  for  determining  whether  an 
order  inversion  should  be  made  is  now  given. 

Define 


Mi 


i-1 

l 

i-1 

J-l 

y 

y » - 

y 

a + y 

L 

L nm 

L 

Jm  L 

B»»0 

rr=l 

m=0 

n=0 

where  ■= 

"10 

' *00  = 0 

ni 


(5.13) 


is  the  sum  of  all  o's  in  the  following  blocked  in  portions  of 


I 
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base  b representations  of  the  e^'s: 


*11  •**  *11 


..  cr 


IK 


*J1  **•  *J1  •**  *JK 


9n  ••• 


Then,  9^  is  obtained  from: 


Ji  ■ "31 

if  Qj^  is  even 

Ji  " b - 1 •"31 

if  is  odd 

(5.14) 

The  Column  Mapping  defines  an  ordering  path  that  orders  the 
elementary  regions  in  £ in  exactly  the  same  way  as  a curve  in  the 
sequence  of  curves  defined  by  Moore  t42],  who  shows  for  the  2-dimen- 
sional case  that  the  limit  curve  is  a space-filling  curve  (Peano  [43]). 
Example  Figure  29  illustrates  the  Column  Mapping  for  b =3,  K = 2, 
i=  2.  It  shows  the  ordering  path  and  pairs  of  corresponding  sets 
(labeled  with  corresponding  letters).  By  unbending  the  ordering 
path  and  carrying  along  with  it  the  elementary  regions  in  £ through 
which  it  passes,  the  elementary  regions  are  strung  out  in  a line  or 
column,  hence  the  name  Column  Mapping. 
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The  fact  that  a Column  Mapping  ordering  path  is  continuous  ensures 
that  there  is  an  adjacent  pair  of  elementary  regions  in  £ corresponding 
to  each  adjacent  pair  of  elementary  intervals  in  ft.  The  term  "quasi- 
continuous"*  is  adopted  to  describe  this  property  of  the  mapping 
(actually  a property  of  the  inverse  mapping).  It  is  this  property  that 
was  required  in  order  to  convert  the  Lipschitz  constants  by  using  (5.7). 

5.3.3  Other  Mappings  with  the  Quasi-Continuity  Property 

Other  mappings  having  the  qua3i-continuity  property  can  be 

defined.  For  example,  the  elementary  regions  can  be  ordered  according 
to  a curve  in  a sequence  of  curves  giving  the  Hilbert  realization  [45] 
of  a space  filling  curve.  Figure  30  illustrates  an  ordering  path 
that  could  result.  In  a recent  paper  [53]  Butz  has  defined  the  Hilbert 
Curve  Mapping  algebraically  for  l dimensions. 

Starting  with  the  Column  Mapping,  the  dimensions  can  be  ordered 
differently  in  different  regions  giving,  for  example,  the  mapping 
illustrated  by  Figure  31.  For  later  reference,  the  mapping  of  Figure  31 
is  called  a "modified  Column  Mapping." 

5.3.4  A Mapping  Criterion 

Several  mappings  have  been  discussed.  Of  these,  the  Dovetail 
Mapping  does  not  have  the  quasi-continuity  property  and  is  not  con- 
sidered further.  A criterion  is  now  suggested  for  use  in  determining 
which  of  the  mappings  with  the  quasi-continuity  property  is  most 
appropriate. 

Butz  [44]  uses  the  term  "quasi-continuous"  in  a similar  context. 
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The  1 -dimensional  partition  adjustment  technique  of  Chapter  IV 
automatically  combines  contiguous  groups  of  elementary  intervals  in 
9 together.  Because  of  the  quasi-continuity  property  of  the  mappings 
now  considered,  the  corresponding  elementary  regions  in  4),  when  com- 
bined, form  a region  that  is  contiguous.  The  d.f.'s  f^  and 
are  approximated  by  constants  over  each  such  combined  region.  One 
would  expect  these  constant  approximations  to  be  the  most  accurate 
when  each  possible  combined  region  is  tightly  knit  together.  Then, 
a reasonable  approach  is  to  seek  the  mapping  that  minimizes  the 
maximum  value  of  the  ratio 

/ Maximum  length  of  the  combined  1 -dimensional  region^ 

V in  any  coordinate  direction.  J 

0 = — ... 

/ 4 -dimensional  volume  of  the  combined  £-dimensional  \ 

V region.  / 

(5.15) 

for  any  possible  combined,  £ -dimensional  region.  For  the  Column 
Mapping,  Cl  satisfies 

Q<(2b/_1  (5.3.6) 

indicating  for  a given  dimensionality  £,  that  0 is  independent  of 
K,  but  that  the  base  b should  be  chosen  as  small  as  possible.  The 
smallest  nontrivial  odd  base  is  3 (recall  that,  for  the  Column 
Mapping,  b must  be  odd).  For  this  reason  only  base  3 is  considered 
further  for  use  with  the  Columm  Mapping.  With  b -3  and  £-2  in  (5. 16), 
Q is  bounded  by  6.  The  worst  case  for  the  Hilbert  Curve  Mapping  of 
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Figure  30  would  be  the  case  in  which  four  elementary  regions  in  a line 
are  combined.  0 from  (5.15)  is  then  bounded  by  4.  Similarly,  Q for 
the  Modified  Colunn  Mapping  of  Figure  31  is  bounded  by  the  value  5.4. 

The  examples  discussed  later  in  this  chapter  all  use  the  Column  Mapping. 
However,  it  is  apparent  that,  based  on  the  ratio  Q,  the  Hilbert 
Curve  Mapping  and  the  Modified  Column  Mapping  merit  further  study. 

S-/.  (".nmnut.er  Simulated  Results 

To  demonstrate  the  extension  to  multidimensions,  several  two- 
dimensional  examples  using  the  Column  Mapping  are  presented.  The 
mapping  parameters  for  each  example  are  b = 3,  K = 3»  and  1 = 2, 
giving  the  ordering  path  illustrated  in  Figure 32  . 

5.4.1  Examples 

The  examples  all  use  class-conditional  d.f. 's  that  are  either 
Gaussian  or  linear  combinations  of  Gaussian  d.f.'s.  Though  the  procedure 
does  not  require  Gaussian  data,  such  data  is  easy  to  generate  on  the 
computer  and,  with  linear  combinations  of  Gaussian  d.f.'s,  is  felt  to 
represent,  as  well  as  any  data,  the  type  of  problems  to  be  handled. 

Table  1 lists  the  weighting  coefficients,  means  and  covariance  matrices 
used  for  the  components  of  the  linear  combination  in  each  example.  Any 
observation  falling  outside  the  domain  A is  rejected  and  a new  one 
obtained.  This  truncation  effect  is  minimal  for  the  examples  because 
of  the  placement  of  the  component  d.f.'s  well  within  the  boundaries  of 
f.  A priori  probabilities  are  assumed  to  be  0.5.  The  goal  is 
to  satisfy  condition  (1.6)  with  a = 0.1  and  0 **  0.9.  Figures  33 
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Table  1.  Definition  of  Examples 


ixample 


Class 


Means 


Covariance 

Matrices 


Class 


0.394 

0.606 


Mode 

1 

0.5 

Mode 

2 

0.5 

Mode 

1 

0.5 

Mode 

2 

0.5 

Mode 

1 

0.5 

Mode 

2 

0.5 

Mode 

1 

0.5 

Mode 

2 

0.5 

0.01 

0 

0 

0.01 

0.01 

0 

0 

0.01 

0.01 

0 

0 

0.01 

0.01 

0 

0 

0.01 

0.01 

0 

0 

0.01 

0.0025  0 

0 

0.0025 

0.0025  0 

0 

0.0025 

0.0025  0 

0 

0.0025 

0.0025  0 
0 0.0025 


0.0025  0 
0 0.0025 


Covariance 

Matrices 


0.606 

0.394 


Mode  1 
0.5 


Mode  2 
0.5 


Mode  1 
0.5 


Mode  2 
0.5 


0.01 

0 

0 

0.01 

0.01 

0 

0 

0.01 

0.01 

0 

0 

0.01 

0.01 

0 

0 

0.01 

0.04 

0 

0 

0.04 

0.04 

0 

0 

0.04 

0.0025  0 

0 

0.0025 

.35 

0.01  0 

Mode  1 

.35 

0 0.01 

0.5 

.65 

0.01  o 

Mode  2 

.65 

0 0.01 

0.5 

.25 

0.01  0 

Mode  1 

.25 

0 0.01 

0.5 

E 

5 

0.01  0 

Mode  2 

1 

5 

0 0.01 

0.5 

0.0025  0 
0 0.0025 


0.0025  0 
0 0.0025 


0.0025  0 
0 0.0025 


.35 

.65 

0.01  0 
0 0.01 

.65 

.35 

0.01  0 
0 0.01 

.25 

.75 

0.01  0 
0 0.01 

.75 

.25 

0.01  0 
0 0.01 
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through  U2  illustrate  the  resulting  assignments  of  regions  to 
the  two  classes.  A circle  with  radius  one  standard  deviation  is 
drawn  about  the  center  of  each  component  to  facilitate  visualization 
of  the  results. 

Figures  33  through  36  for  Examples  1 through  U show  that 
the  procedure  succeeds  for  different  arrangements  of  the  class- 
conditional  d.f.'j.  Separation  of  the  means  in  each  case  is  three 
times  the  standard  deviation.  Figures 37  and  3 8 for  Examples  5 
and  6 illustrate  the  capability  of  the  procedure  to  separate  the 
space  based  solely  on  the  dispersion  of  the  distributions. 

Figures  39  through  42  for  Examples  7 through  10  portray  some 
bimodal  results  for  cases  in  which  regions  assigned  to  the  two 
classes  are  interleaved. 

Similar  to  the  one-dimensional  case,  it  is  found  that  signif- 
icantly fewer  training  observations  are  required  when  smaller 
Lipschitz  constants  are  used.  For  the  examples  illustrated,  the 


(5.7)  from  the  smallest  applicable  Lipschitz  constants  for  the 
d.f.'s.  The  -ffeet  on  the  boundary  of  using  the  smaller  constants 
is  not  serious  for  these  examples.  However,  it  must  be  noted 
that  an  assumption  of  the  problem  is  violated,  and  attainment 
of  the  goal  is  not  verified.  With  large  constants  L^,  the  solution 
is  generally  obtained  only  after  very  many  training  observations; 
however,  It  is  observed  that  tentative  classification  of  the 
domain  usually  settles  down  quickly  to  a reasonable  result. 

Thus,  another  way  to  stake  the  procedure  more  nearly  practical  is  to 
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use  the  tentative  results  when  results  are  needed  but  to  allow 

the  system  to  continually  process  additional  incoming  training 

observations  until  the  final  solution  is  obtained.  If  large 
* 


enough  constants  have  been  used,  then  the  final  result  can 
be  trusted. 

for  these  examples,  the  procedure  terminates  after  approx- 
imately 1000  training  observations.  The  maximum  number  of  inter- 
vals allowed  is  15  for  the  unimodal  d.f.  problems  and  20  for  those 


with  blmodal  d.f.'s. 


5.U. 2 Computational  Aspects 

As  the  complexity  of  the  problem  increases,  that  is,  as  the 
Ki  product  increases,  some  computational  problems  appear.  For 
example,  the  IBM  1130  computer  system  used  for  the  examples 
maintains  accuracy  to  about  five  significant  decimal  digits.  So 
long  as  the  real  number  representation  of  an  interval  boundary* 
needs  no  more  than  five  decimal  digits  accuracy,  the  ordinary 
arithmetic  operations  and  storage  techniques  provided  with  the 
computer  system  can  be  employed.  Five  decimal  digits  corresponds 
roughly  to  ten  ternary  digits;  thus,  if  a mapping  using  base  3 
is  employed,  the  K £ product  is  limited  to  about  ten  with  ordinary 
operations  of  the  IBM  1130  system.  This  corresponds  at  one 
extreme  to  a ten-dimensional  problem  with  each  dimension  parti- 
tioned into  three  intervals,  and  at  the  other  extreme,  to  a 

*An  interval  boundary  is  identified  by  the  elementary  interval 
inediately  to  its  right. 


two-dimensional  problem  with  each  dimension  partitioned  into  243 
intervals.  For  either  case  (or  any  intermediate  case)  an  inter- 
val boundary  (corresponding  to  a mapped  region  boundary)  can  be 
stored  as  an  ordinary  real  number. 

When  Ki  > 10,  other  techniques  must  be  employed.  One  approach 
is  to  employ  a computer  system  with  more  storage  in  each  computer 
word;  however,  at  some  critical  Ki  product  for  a given  base  b, 
the  problem  reappears.  Another  approach  is  to  nrovide  for  the 
storage  of  each  boundary  in  jeveral  words  of  storage.  Such 
extended  precis' on  requires  programs  to  handle  the  arithmetic 
operations  involved.  Increased  computer  time  as  well  as  increased 
storage  (for  the  multiword  interval  boundaries)  results.  For 
the  examples  handled  in  this  report,  one  word  per  boundary  is 
used.  All  variables  and  the  entire  program  are  contained  in 
the  16000,  16-bit  word,  main  storage  of  the  IBM  1130  computer 
system.  Processing  time  for  each  of  the  two-dimensional  examples 
is  approximately  five  minutes. 

5.5  Qthar  Uion  for  the  ffcpplnga 

This  section  briefly  discusses  other  uses  for  the  mappings 
described  in  Section  5.3. 

5.5.1  Display  of  Real -Valued  Functions 

A real-valued  function  of  more  than  one  real  variable  is 
difficult  to  observe.  The  two-dimensional  display  surfaces 
generally  used  have  the  capability  of  displaying  such  a function 
defined  on  no  more  than  one  variable.  When  the  domain  is  greater 
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than  one-dimensional,  various  projections  and  sectional  views 
can  be  used  to  gain  a perspective  of  the  function.  Another 
approach  is  to  first  map  the  multidimensional  domain  to  one 
dimension  via  one  of  the  mappings  described  in  Section  5.3.  Then 
the  function's  one-dimensional  equivalent  can  be  displayed  on  a 
two-dimensional  surface  [46].  Figure  43  illustrates  the  resulting 
display  for  a bivariate  Gaussian  d.f.  given  by 


f(x) 


I 

2n(0.25)2 


r /xi  “ 0*  5\2-, 

expL~  i L \ ~o~2T~)  J 

i=l 


where  the  mapping  used  is  the  Column  Mapping  with  b = 3,  K = 3, 
and  1 = 2. 

Unless  one  is  accustomed  to  observing  bivariate  Gaussian 
d.f.' s in  the  form  displayed  by  Figure  43,  the  function  represented 
there  probably  is  unrecognizable  as  a transformed  Gaussian  d.f. 

For  purposes  of  recognizing  functions,  the  display  has  little 
value.  It  is  for  purposes  of  comparing  functions  that  such  a 
display  can  profitably  be  used.  One  application  is  to  display 
the  difference  of  two  d.f. 's  in  order  to  get  an  idea  of  what  has 
been  called  the  "separability"  of  the  two  functions.  Two  d.f.'s 
are  highly  separable  if  the  d.f.  generating  an  observation  can 
be  identified  from  the  observation's  location  in  <8  with  small 
probability  of  error. 
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5.5.2  Parameter  Sensitivity  Studies 

A display  such  as  described  in  5.5.1  can  be  used  for  parameter 
sensitivity  studies.  Suppose  a system  has  a real-valued  function 
output  depending  on  M real  variables  or^,  ar2,...,c»M,  as  illustrated 
in  Figure  UU. 


Figure  UU.  Study  of  a System's  Input  Parameters 


Suppose  that  a known  setting  of  these  parameters  produces  a 
desired  function  f( A problem  is  to  adjust  for  a 
cheaper  set  of  parameters  without  significantly  degrading  the 
function.  The  difference  between  the  desired  output  and  the 
output  with  adjusted  parameters  can  be  continually  monitored 
as  the  parameters  are  adjusted.  Such  a use  might  not  require 
the  resolution  of  the  difference  function  to  be  very  large. 

In  that  case  the  result  might  be  mapped  back  up  to  two  dimensions 
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via  the  inverse*  of  a napping  with  the  "quasi-continuity"  property, 
and  the  difference  function  displayed  as  intensity  modulation. 

The  use  of  a color  display  can  further  enhance  the  usefulness 
of  the  mappings  if  quick  interaction  from  the  operator  is  desired. 
For  example,  while  intensity  is  used  to  portray  the  difference 
function,  color  can  be  used  to  identify  the  region  in  £ corre- 
sponding to  any  point  on  the  display.  Although  this  information 
is  already  available  from  the  location  of  the  point  on  the 
display,  color  enables  its  determination  to  be  made  more  quickly. 

If  particular  regions  in  are  of  interest,  different  colors  can 
be  reserved  for  use  at  their  corresponding  display  points. 

For  another  example  consider  the  problem  of  representation 
of  a d.f.  as  a linear  combination  of  Gaussian  d.f.'s.  Suppose 
that  an  acceptable  representation  (perhaps  from  a histogram  or 
some  other  estimation  procedure)  has  been  obtained,  but  that 
the  number  of  paramete.s  used  is  impracticably  large.  The 
difference  between  the  acceptable  representation  and  the  linear 
combination  of  Gaussian  d.f.'s  can  be  mapped  to  the  real  line. 

A viewer  controlling  the  parameters  describing  the  linear  repre- 
sentation can  interact  with  the  displayed  result  to  find  a set 
of  parameters  giving  suitable  agreement  between  the  two  repre- 
sentations. The  representation  problem  Just  described  also 
occurs  in  un supervised  estimation  problems. 

♦The  same  formula  for  the  inverse  can  be  used  as  for  the  mapping 
itself.  That  is,  the  3‘s  derived  from  the  ar's  per  (5.13)  and 
(5.14)  can  be  themselves  processed  by  (5.13)  and  (5.14)  as  if 
they  were  the  ar's  to  get  y’*.  The  resulting  Y'e  are  the  original 
ar's. 
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5.5.3  Data  Reduction 

Many  data  reduction  schemes  operate  on  real-valued  functions 
of  one  real  variable.  They  use  techniques  to  reduce  the  redundancy 
in  the  function  so  that  it  can  be  represented  with  as  few  parameters 
as  possible.  The  case  in  which  the  domain  is  multidimensional 
can  be  handled  by  mapping  the  domain  to  one  dimension.  For 
example,  the  television  camera  with  its  raster  scan  reduces  a 
function  of  intensity  on  two  dimensions  to  a function  on  one 
dimension*.  Because  each  line  in  the  raster  traverses  from 
one  side  of  the  picture  to  the  other,  the  function  cannot  gen- 
erally be  well  approximated  by  a constant  for  the  length  of  the 
line.  However,  if  the  line  were  to  wander  around  in  a more  or 
less  tightly  knit  region  such  that  the  same  area  of  the  picture 
is  covered,  it  is  reasonable  to  assume  that  fewer  changes  in 
intensity  will  be  encountered  and  hence  a better  chance  for  a 
satisfactory  constant  approximation  exists.  Using  the  same 
argument  throughout  the  space  leads  to  the  conclusion  that  a 
mapping  such  as  the  Colum  Mapping  can  give  a function  of  one 
variable  that  is  generally  characterizable  with  fewer  parameters 
(at  least  when  using  a piecewise  constant  representation)  than 
the  ordinary  raster  scan  type  mapping.  Hence,  such  mappings  can 
be  considered  for  use  with  data  reduction  schemes.  Abend,  Harley, 

♦Note  that  the  conventional  television  raster  scan  is,  except  for 
the  interleaving  feature,  a special  case  of  the  Dovetail  Mapping 
of  Section  5.3.1. 
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and  Kanal  [47]  have  considered  the  Hilbert  Curve  Mapping  to 
account  for  spatial  dependencies  of  randan  variables  along  the 
ordering  path. 

For  purposes  of  data  reduction,  it  nay  be  advantageous  to 
alter  the  mappings  in  a way  that  depends  on  the  data;  that  is, 
to  make  the  mappings  interactive  with  the  data.  For  example, 
instead  of  modifying  the  Colunn  Mapping  of  Figure  29  , by  re- 
ordering the  dimensions  in  different  regions  to  obtain  Figure  31  , 
the  ordering  of  the  dimensions  in  a region  could  be  made  to  depend 
on  the  function  in  that  region.  For  the  two-dimensional  case 
(pictures),  both  orderings  of  the  dimensions  can  be  considered, 
and  the  one  best  satisfying  some  criterion,  e.g.  smoothness  of 
the  function  along  the  resulting  ordering  path,  can  be  chosen 
for  the  mapping. 

5.5.4  Scanning  for  Regions  with  Specified  Function  Values 

Butz  [44]  considers  what  he  calls  a "Finite  Peano  Mapping" 
which  is  essentially  the  inverse  to  the  Colunn  Mapping  for 
base  3.  From  knowledge  of  properties  of  a function  f defined 
on  the  domain  A,  he  searches  for  regions  satisfying  f (x)  < 0. 

Butz  derives  numerical  bounds  describing  the  quasi-continuity 
of  the  mapping.  Then,  fron  properties  assumed  satisfied  by 
function  f,  an  "Implicitly  exhaustive  search"  procedure  can 
be  used  to  find  regions  in  A for  which  f(x)  < 0.  Butz  calls 
the  search  Implicitly  exhaustive  because  every  point  is  accounted 
for  without  making  cosjputatlons  at  every  point. 
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Other  Extensions  to  Multidimensions 

All  the  described  mappings  map  regions  in  £ one-to-one  onto 
intervals  in  ft.  Such  mappings  provide  a way  to  extend  the 
recognition  techniques  of  previous  chapters  to  mult  id  intensions. 

In  addition,  they  have  other  uses  as  discussed  in  Section  5.5. 

Other  mappings  with  the  general  purpose  of  reducing  the 
dimensionality  of  data  vectors  have  been  defined  in  the  litera- 
ture. When  fhe  dimensionality  is  reduced  to  one,  it  is  reasonable 
to  consider  these  mappings  as  the  means  to  extend  the  current 
work  to  multidimensions. 

Mappings  that  operate  only  on  the  observations  have  been 
considered  by  Shepard  and  Carroll  [48].  They  map  the  set  of  n, 
4-dimensional  observations  to  a set  of  n,  i' -dimensional 

observations  where  l'  < l.  They  strive  to  obtain  this 

mapping  so  that  an  index 
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that  measures  continuity  inversely  is 
and  are  distances  between  the  i*^ 
measured  in  the  i-dimensional  and  the 
respectively.  That  is 


minimized.  In  (5.17)  d^ 
and  observations  as 
4 '-dimensional  spaces 
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where  is  the  ? component  of  observation  y^  *=  (y^,...,y^^) 


and  is  the  5^  component  of  observation  = (x^ xii'^* 
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In  (5.17),  W11  Riven  by 
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is  included  to  weigh  the  effect  of  the  relation  between  the  i 
and  observations  less  as  the  mapped  distance  between  them 
is  increased.  The  denominator  C given  by 
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(5.20) 


is  included  for  normalisation  purposes.  Without  it,  k could  be 
made  as  small  as  desired  by  making  each  large.  References 
[/*9,50]  also  consider  mappings  of  this  type. 

Another  mapping  that  maps  only  the  observations  is  the 
"Chain  Happing"  [23].  The  Chain  Mapping  considers  the  obser- 
vations sequentially— the  next  member  in  the  sequence  is  the 
nearest  neighbor  to  the  current  observation.  An  observation 
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is  mapped  from  V*  to  the  real  line  such  that  the  distance  to  the 
previous  member  in  the  sequence  is  preserved. 

An  important  use  for  mappings  that  operate  only  on  the 
observations  is  to  reduce  their  dimensionality  so  that  they 
can  be  displayed.  If  preserved  by  the  mapping,  clustering 
information  and  other  relations  among  the  data  can  be  learned 
visually  from  the  display.  Because  of  the  inability  to  observe 
the  data  in  the  original  multidimensional  space,  these  relations 
could  go  unnoticed  without  the  mapping.  Applications  include 
problems  in  radar  and  sonar.  For  example,  signals  from  targets 
can  be  converted  to  jt-'dimensional  vectors,  mapped  to  lower 
dimensional  vectors,  and  observed.  If  the  mapping  has  preserved 
cluster  relationships,  it  may  be  possible  to  separate  the  data 
into  two  groups.  Naming  one  group  warheads  and  the  other  decoys 
could  occur  with  additional  information  such  as  knowledge  of 
the  ratio  of  warheads  to  decoys. 

A disadvantage  of  such  mappings  for  the  current  work  is 
the  fact  that  they  map  only  the  observations  and  do  not  treat 
the  rest  of  the  space.  The  mapping  of  additional  observations 
is  handled  by  reprocessing  the  whole  set  with  the  additional 
ones  appended.  A way  to  avoid  this  problem  would  be  to  process 
Just  once  an  appropriate  sized  subset  of  the  observations. 

The  mapping  at  other  points  in  the  space  could  be  defined  by 
using  an  interpolation  procedure.  For  example,  a vector  could 
be  mapped  to  the  real  line  so  that  the  ratio  of  distances  from 
the  vector  to  its  two  nearest  neighbors  in  the  4-  dimensional 
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space  is  preserved  and  so  that  the  mapped  vector  lies  between 
the  mapped  nearest  neighbors.  The  approach  can  be  used  for 
the  current  work  but  requires  assumed  Lipschitz  conditions  on 
d.f.'s  for  the  mapped  observations. 

Another  mapping  that  can  be  used  is  one  proposed  by  Patrick 
and  Fischer  [51].  It  is  a linear  transformation  from  the  A- 
dimensional  to  the  A '-dimensional  space.  The  transformation 
is  chosen  to  maximize  a measure  of  separability  between  an 
estimated  d.f.  on  transformed  Class  observations  and  an 

estimated  d.f.  on  transformed  Class  observations.  The  d.f. 

estimates  are  of  the  Parzen  [ll]  type.  The  measure  of  separa- 
bility between  these  estimates  is  defined  to  be  the  square  root 
of  the  integral  of  their  difference  squared.  This  mapping 
resembles  the  mapping  proposed  by  Shepard  and  Carroll  provided 
their  mapping  is  first  extended  to  the  whole  domain  via 
an  interpolation  procedure.  Both  approaches  depend  only  on  the 
original  training  observations.  Important  differences  are  that 
Patrick  and  Fischer's  mapping  is  linear  and  maximizes  a measure 
of  separability,  whereas  Shepard  and  Carroll's  mapping  is  non- 
linear aid  minimizes  an  index  that  measures  continuity  inversely. 
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CHAPTER  VI 
CONCLUSIONS 


6-1  Summary  of  Results 

A procedure  is  described  for  determining  a decision  rule  d for 
the  l-dimensional,  2 class,  nonparametric,  recognition  problem 
with  unknown  class -conditional  density  functions.  A priori 
probabilities  are  known,  and  the  density  functions  are  assumed 
to  satisfy  Lipschitz  conditions  with  known  Lipschitz  constants. 

The  Drocedure  allows  the  achievement  of  a specified  confidence 
that  the  probability  of  a recognition  error  when  using  d is  within 
a specified  constant  of  the  minimum  attainable  probability  of 
recognition  error.  A fixed  storage  constraint  is  imposed. 

Histogram  estimates  of  the  unknown  density  functions  using 
( a R-interval  partition  of  the  domain  are  obtained  from  a sequence 

of  training  observations.  These  estimates  are  used  to  define 
the  decision  rule  d.  The  specified  confidence  is  achieved  by 
achieving  a similar  confidence  for  each  interval  in  the  partition. 
During  training,  the  partition  (always  restricted  to  R intervals 
or  less)  is  altered  in  an  effort  to  improve  a measure  of  performance. 
Histogram  estimation  of  the  density  functions  procedes  based  on  the 
new  partition  but  makes  use  of  Information  obtained  while  using  the 
old  one.  A proposition  presents  requirements  to  achieve  a specified 

I , 

• ) 
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confidence  for  a fixed  partition;  however,  there  are  no  theoretical 
results  showing  achievement  of  the  confidence  when  the  partition 
is  adjusted.  Experimental  results  for  several  one-dimensional 
examples  are  presented  that  demonstrate  achievement  of  the  desired 
confidence  when  the  partition  is  adjusted  with  R selected  somewhat 
larger  than  the  number  of  decision  thresholds.  These  experimental 
results  use  training  observations  from  densities  that  are  linear 
combinations  of  Gaussian  densities.  The  results  indicate  that 
Lipschitz  constants  smaller  than  the  minimum  applicable  values 
give  improved  performance  (a  decrease  in  the  number  of  training 
observations  required  without  increasing  the  probability  of  error 
above  acceptable  limits).  The  explanation  is  that  the  density 
functions  of  the  examples  satisfy  Lipschitz  conditions  with  smaller 
constants  in  seme  intervals  than  in  others.  This  suggests  the 
possibility  of  supplying  different  Lipschitz  constants  for  different 
intervals,  perhaps  through  an  operator  interacting  with  a histogram 
display  or  automatically  by  an  estimation  technique.  The  recognition 
results  would  then  be  based  on  assumptions  of  the  density  functions 
satisfying  "local"  Lipschitz  conditions  with  the  supplied  constants. 

Extension  of  the  procedure  to  4-dimensional  observation  vectors 
is  achieved  using  a transformation;  this  transformation  maps  elemen- 
tary regions  in  a partition  of  the  /-dimensional  observation  space 
one-to-one  onto  elementary  intervals  in  a partition  of  a one- 
dimensional domain.  ftie«dlmenslonal  mapped  versions  of  the  Z- 
dlmensional  training  observations  are  then  used  in  the  one-dimensional 
procedure. 
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The  recognition  procedure  may  have  engineering  application  to 
problems  for  which  storage  is  limited  but  many  training  observations 
are  available.  One  possible  example  is  a recognition  system  built 
for  long  life  space  vehicles  which  are  weight  and  hance  storage 
limited. 

L2 Sxtaialgna 

Assumed  Lipschitz  conditions  on  the  density  functions  fj  allow 
bands  of  uncertainty  to  be  placed  about  the  averages  of  the  functions 
in  each  interval.  The  bands  are  statistically  described  by  distrib- 
utions on  the  averages  where  the  distributions  are  obtained  from 
training  observations.  The  classification  procedure  of  Chapter  II 
uses  these  statistically  described  bands. 

Statistically  described  bands  of  uncertainty  can  be  obtained 
using  a priori  knowledge  other  than  Ldpschitz  conditions  on  the 
density  functions.  For  example,  one  could  directly  assume  bands  of 
uncertainty  about  the  averages  in  each  interval.  More  generally, 
one  could  assume  bands  of  uncertainty  about  the  approximation 

t=l 

in  an  interval  involving  more  terms  than  just  the  average  of  fj. 

The  functions  for  simplicity  would  be  orthonormal.  The  bands 

of  uncertainty  would  be  described  statistically  by  distributions  on 

the  parameters  (c..  T where  the  distributions  are  obtained  from  the 
Jt 

training  observations. 
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A priori  knowledge  consisting  of  bounds  Sj  on  the  variation  of 
the  d.f.'s  has  been  considered.  The  variation  is  intuitively  appealing 
because  it  provides  a measure  of  the  absolute  value  of  the  derivative 
averaged  over  the  domain.  In  addition  such  knowledge  allows  for  discon- 
tinuous d.f.'s.  Instead  of  computing  the  band  of  uncertainty  by 

| f - f (x) ! < L,W/2  for  an  interval,  the  band  can  be  computed  by 

J J O 

|fj  - fj(x)|  < Sj.  for  example,  d.f.'s  used  in  the  experimental  work 
of  Chapter  IV  have  maximum  derivatives  equal  to  25  but  variation  equal 
to  8.  However,  the  band  of  uncertainty  computed  from  the  variation 
does  not  decrease  with  interval  width  as  required  for  the  classifi- 
cation of  some  intervals.  Thus  the  procedure  could  not  in  general 
use  bounds  on  the  variation  of  fj  for  classification  of  all  intervals. 
Such  bounds  could  profitably  be  used  in  those  intervals  for  which  it 
is  known  that  Fj  < LjW/2. 

The  recognition  procedure  is  extended  to  ( dimensions  via  a 
transformation  that  essentially  converts  the  .(-dimensional  problem 
into  a one-dimensional  problem.  The  transformation  approach  is 
desirable  because 

1)  The  one-dimensional  techniques  can  be  used. 

2)  A partition  of  the  (-dimensional  domain  can  be  altered  by 
altering  the  corresponding  partition  of  a one-dimensional 
domain. 

3)  The  bookkeeping  operation  involved  in  storing  the  partition 
is  simplified  by  storing  the  equivalent  one-dimensional 
partition. 


* 
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Disadvantages  of  the  approach  for  handling  the  X-dimensional  problem 
and  suggestions  for  their  relief  are 

a)  A partition  of  the  £ -dimensional  domain  has  definite 
restrictions  imposed  by  the  transformation.  It  is  desirable 
that  the  transformation  tends  to  form  partitions  with 
tightly  knit  regions.  The  Hilbert  Curve  Mapping  illustrated 
in  Chapter  V for  two  dimensions  is  better  in  this  regard 
than  the  transformation  used.  The  implementation  of  the 
Hilbert  Curve  Mapping  should  thus  give  improved  results. 

b)  The  act  of  transforming  the  problem  to  one  dimension 
causes  neighborhood  information  between  neighboring  obser- 
vations in  l dimensions  to  be  lost.  The  effect  is  that 
more  training  observations  are  required  than  if  the 
solution  were  carried  out  solely  in  £ dimensions.  A way 
to  decrease  this  effect  is  to  account  for  the  neighborhood 
information  before  performing  the  transformation.  For 
example,  a cluster  of  observations  could  be  placed  about 
each  training  observation.  Their  mapped  equivalents  in 
one  dimension,  if  treated  as  mapped  training  observations 
carry  the  neighborhood  information  with  them.  This 
operation  can  be  interpreted  as  smoothing  the  data  before 
mapping.  Another  solution  is  to  carry  out  the  entire 
analysis  in  X dimension;  this  involves  developing  £- 
dimensional  procedures  for  use  with  regions  in  a partition 
of  the  £ -dimensional  domain.  Techniques  for  a £ -dimensional 
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region  would  be  similar  to  the  techniques  for  a one- 
dimensional interval  except  that  interval  width  would  be 
replaced  with  maximum  distance  across  the  region.  The 
required  storage  and  partition  adjustment  would  be  the 
primary  difficulties. 

Experimental  results  in  Chapter  IV  indicate  that  for  a one- 
dimensional Cau33ian  example,  the  nonparametric  procedure  can  require 
over  one  hundred  times  as  many  observations  as  the  optimum  Gaussian 
procedure  to  achieve  equal  performance.  The  reason  is  that  the 
nonparametric  procedure  does  not  utilize  the  a priori  knowledge  that 
the  density  function  is  Gaussian.  This  can  be  an  advantage  when  the 
density  function  is  not  Gaussian;  on  the  other  hand  it  is  desirable 
to  have  provision  for  using  a priori  knowledge  should  it  be  available. 

Gaussian  approximations  were  used  for  the  beta  densities  to 
simplify  the  integration  over  a region  V in  the  (U  , U,  ) plane.  For 
small  n a numerical  or  a Monte  Carlo  integration  method  could  be  used. 
The  latter  method  can  be  accomplished  by  generating  ordered  pairs  of 
observations  from  statistically  independent  beta  distributions.  The 
first  coordinate  is  generated  according  to  9*(Uft  ival*Va2)  and  the 
second  according  to  3*(Ub| Ybl,Yb2)*  The  relative  frequency  with 
which  the  result  (U  , U,  ) occurs  in  V is  an  estimate  for  Pr(V).  The 

¥fl 

coordinate  is  easily  generated  by  setting  Uj  = * " with  Pj 

generated  according  to  9(Pj|Yj^f  Yj2).  '’j  Is  taken  as  the 

smallest  outcome  from  a sequence  of  y.  + Y.  - 1 observations  of 

J1  J2 

a uniform  distribution  on  the  interval  ^0,1].  This  approach  is 


, w 


vT  P 


- 1 because  of  the  large  number 


impractical  for  large  n,  = v,  + V. 

J Jp 

of  uniformly  distributed  observations  required. 

This  report  discusses  the  two  class  recognition  problem. 
Generalization  of  the  procedure  to  multiclasses  has  not  been  accom- 
plished. One  way  to  deal  with  the  multiclass  problem  using  the  two 
class  procedure  is  to  lump  classes  into  two  disjoint  groups.  The 
group  chosen  can  be  split  into  two  disjoint  groups,  etc.,  so  that 
finally  the  chosen  group  consists  of  just  one  class. 
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APPENDIX  A 


The  object  of  thi.3  appendix  is  to  discuss  methods  for  specifying 
the  characterizing  Darameters  in  a beta  d.f.  on  Pj(i) 

f(Pj(i) |sj1(i),Sj2(i)) 

r(s  (i)  + s..(i))  S (i)  - 1 s12(i)  ' 1 

- r(. £ti»r(.%in  “jW  J1 

(A.l) 


such  that  a priori  knowledge  about  Pj(i)  is  accounted  for.  Pj(i) 
represents  uncertainty  in  Pj(i),  the  probability  of  an  observation 
from  class  w.  falling  in  the  ith  interval  of  a given  R-interval 
partition  of  [0,l], 

First,  consider  the  case  where  no  a priori  knowledge  is  available. 
The  characterizing  parameters,  Sjl(l)’  SJ2  (i),  can  be  chosen  as 


”ji(1)  ‘ 1 
8 (i)  = R “ 1 


(A. 2) 


s's  so  chosen  are  consistent  with  s's  replacing  Y's  specified  by 
Equations  (2. A),  orovided  the  Pj(i)'s  are  described  Jointly  by 
the  R - 1 variate  Dirichlet  d.f.  having  l's  for  parameters.  Thun, 
each  allowed  set  of  Pj(i)'s  is  equally  likely,  a condition  sometimes 
said  to  correspond  to  no  a priori  knowledge.  Another  method  of 


i 
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defining  the  3' a without  benefit  of  a priori  knowledge  on  the 
P.(i)'s  is  to  use  the  first  R - 1 training  observations  to  specify 


th 


j 

the  initial  partition.  s^i)  is  chosen  as  the  ratio  of  the  i 
interval  width  to  the  width  of  the  smallest  interval  containing 
the  ith  interval  but  having  boundaries  defined  by  observations 


from  class  u)j.  For  these  computations,  observations  from  both 
classes  are  assumed  to  exist  at  the  end  points  0 and  1.  The 
parameter  t^  is  specified  to  be  the  number  of  class  observations 
from  the  first  R - 1 training  observations. 

Then  Sj2(i)  *s  Riven  by 


s12(i)  * tj  + 1 - Sj-^i) 


(A. 3) 


The  maximum  value  for  Sj^i)  is  one.  In  practice  s^(i)  is  guaranteed 
nositive  by  discarding  for  interval  forming  purposes  training 
observations  that  cause  ties. 

If  a priori  knowledge  is  available  in  the  form  of  "a  priori 
training  observations"  [35],  (fictitious  observations  that  might 
be  obtained  based  on  what  is  known  about  the  Pj(i)'s),  their 
numbers  can  be  added  to  the  appropriate  s's. 

Now  suppose  that  a priori  knowledge  consists  of  the  expected 
value  E^(i)  and  variance  Varj(i)  for  each  P j ( i ) . Such  knowledge 
can  be  included  by  solving  Equation  (2.5)  for  y^i)  and  Vj„(i) 

(in  this  case  3j1(i)  and  Sj2(i)). 
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rE  (i)(l  - E (i))  , 

9.n^i)  = Ej(i)L  vi^rir  1ji 

rE.(i)(l  - E. (i) ) _ 

,,2(d  . (i  - ^(i))[-^  — ij 


The  resulting  relations  among  E^(i),  Varj(i),  s^(i),  and  tj  are 
illustrated  by  Figure  45.  tj  is  related  to  s^(i)  and  sJ2(i)  by 
(A. 3).  Requirements  on  the  a priori  values  for  ( i ) and  Var^(i) 


0 < Ej(i)  < 1 

E/Dd  - E^D) 

Var^Ti) 


These  requirements  ensure  that  t . is  non-negative  (t.  can  be 

*J  J 

interpreted  as  a number  of  a priori  training  observations),  and 
that  Sj^(i)  and  are  P°sltive  (a  requii’ement  for  parameters 

characterizing  any  beta  d.f.). 


f 
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APPENDIX  B 


The  quantity 
T a 

*b  r° 

L k s(^ 1 y»i,y“  K k 6 k ' Y*i,Y*2  ^"•d°b 

is  evaluated  in  this  aDoendix.  in  (3.1), 

r0  * Min  Lub-q»  \] 


(B.l) 


U2) 


A^t  and  o are  finite,  Dositive,  real  constants,  a is  a real  con- 
stant satisfying  q < 0*  and  y^,  v^t  Y#i » and  v ^ ®re  finite  oo^i- 
tive  integer  constants.  is  defined  by: 

3(z  !«2,a2)  = Be"1(a1,a2)*',l~1(l-7)ft2~1 

if  0 < z < 1 
■5  0 

otherwise  (b.3) 

where 


Bete^Og)  - 


r^r^) 


r (c^  ♦ Og ) 

The  limits  of  integration  in  (B.l)  define  the  region  of  the  (U^,  U ) 
plane  illustrated  in  Figure  46 . 


(B.h) 
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Define : 


J.  r 8'  f Ivvj1 

0 a ' a ' 


Let  x - Then: 

a 


Z=J  oV  “ 0(xlYai»Ya2)^ 
fr°/Aa  .-1,  ..  J 


-1,,  _te-\ 


(Yal.Y42)*  ^ ^ 


r /A  ya.2~1 

-1/  \r  ° Yal’1  V ( Y*?'1  W l)1  xJ  dx 

■X  lv.l'V>]  * h \ i )w 


^~'Y«2"^/ v ,,-l\  j (fQ/A  )’ 

Be-1(Yai,Ya2)Z.  V ^ ) (-1}  TyTv^JT 


(r  /A,  Aal 


Substituting  into  (::.l)  f^ive-.: 


J 6(<s' [bo'V^)?  ( “ 


* TY.75T-  J dBb 


T can  be  written  as  the  sun: 


T = ♦ T2 


(B.5) 


(3.6) 


(B.7) 

(3.9) 


I 
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with  and  T?  riven  by: 


Tl  = v 

( f Y-U* 


(£T 

(Vol^J j 


Ya1+j 


dub 


*b 


T2^y„  ^ S(slvbl’vb2  ) 


dub 


vhero 


v0  = Min  Max(Aa+  q,  o),  aJ 


Let  y = • Then  and  become: 


Ti- 


v/b  Yb2-1 

Be’1<Ybi'  %P)  B*'1(v.  1>V,2)  Z lV“"5)  (-1>ky<  * V‘>1‘1 


lc=0 


Ya2_1 

s: 

*Z 

J=o 


:iYU&r  "V""' 


r.r  1 t 

<v,i»  3> 


• K 


J B”'1(Ybl.Yb2)  }.  ( “ )(-  k 

*A 


Ife2"1 

? 

k=0 


1 r yk  ♦ Ybl-1 


(3.9) 


3.10) 


P.ll) 


i 
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The  expansion 


+J 


vj 

T— 

L 

v=0 


'Yal+J 


v&)y^ 


(B.12) 


can  be  used  in  (B.ll),  the  integration  performed,  and  and 
sumned.  The  result  is: 


Ya2-1 


T = B«  I ( )<"  «J 

X> 


A 

(>) 

(Yal+^ 


Yb2-1 


k*=0 


(YV  «' 


[( 


V. 


(k+^bl-Mr) 


V 


J 


(B.13) 


where  yQ  is  obtained  from  (B.10). 
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APPENDIX  C 


Evaluate  the  integral 


f 

JJ  /5n  a 

VWa 


1 } ' 


/2tt 


Let 


- °b  /Vv,  + (Vs.' 

* ; 1 


” /Ua*>«N 


pLJV,  + — i—j  (Vi1) 
u*  1 <%«&* ( °»  1 


Then 


JJ 

b 


1 ya  _L  "*  Tb 

z=  e 

/2n  /5tt 


dya  dyb 


dU  dU. 
a d 

(c.l) 


(C.2) 


(C.3) 


- 162  - 


APPENDIX  D 


This  appendix  presents  a numerical  procedure  for  the  min- 
imization of  A, 


A = #(-  Q) 


i(x) 


L.  /2n 


e~  ly  djr,  - 


< X < *• 


(D.l) 


with  respect  to  A . The  constants  t*  , ° , C , and  C. 

a a o a 0 a b 

are  all  positive.  In  addition,  it  is  known  that 


(»*  - C 
< __fl 41 

^ 4C. 


H > C 

a a 


(D.2) 


Because  l(-  Q)  is  strictly  decreasing  in  Q,  A is  minimized  by 

maximizing  Q.  Q can  be  written  in  terms  of  A as: 

a 
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Q = 


- <Xa  - Ci>>  + ^ <*,  - C.>  ~ “b 

Kkr)\  - °.>a  + <f 

D 


(D.3) 


Set  the  derivative  of  Q with  respect  to  X to  zero. 

a 


or 


dQ_  _ fr/°a  ,,  r \\2  21^a  ~ Xa> 

dX_  ~ ll\2Cu  (xa  " Ca0  °bJ  \ 2Cb  ) 


,2  2i 


■ ScJ  (Xa  " Ca)  + 2?;  (Xa  " Ca)  " “bK^  (Xa  " Ca}y  + °bj 

• (%)’  <*.  - - vf  + O ' » 


Ky  <>.  - «.>*  * <v^) 


,o_  v2 


- <k.  - CX±)  [jcT  K - Cf>  + it  <\  - C.>  - *b]  - 0 


Collecting  powers  of  (X  - C ) gives: 

a a 


(x  - c y 

I_a & L 

4CV 


2Cl  a; 


2CW  o‘ 


- - o.)k  - - c.)  ^ - 0 


(D.4) 


(D.4)  is  a cubic  in  (X  - C ) and  is  difficult  to  solve  analytically. 

a a 

However,  the  desired  root  may  be  obtained  by  using  the  following 
numerical  technique.  Define  g(X&)  to  be  the  left  side  of  (D.O, 
and  rearrange  to  get 
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g(x) 

a 


ru  - CJ‘ 

- c.)[-TcT- 


2C.  a? 

.] + -y  \ 

a 


Note  from  (D.2)  that 


g(Ca)  " 2Cb  “7  (Ca  ' kia)  < 0 


(Ca  + - •*.)  < 0 


*(“a>  * 


g(\)  > 0,  for  X > n 
a a a 


Conditions  (D.6)  together  with  the  fact  that  the  inflection  point 

for  g is  at  X^  = C&,  guarantee  a unique  root  of  g(X&)  in  the 

interval  (C&  + 2/ ^ Cb>  i*a).  Further,  no  root  of  g(Xfl)  exists 

for  Xa  > na  or  for  \&  in  the  interval  (Ca,  C&  + Cfe  ). 

Physical  considerations  of  the  problem  show  that  the  root  sought 

corresponds  to  a relative  minimum  for  A. 

Let  X&(t)  be  the  root  obtained  at  the  tth  stage  of  a Newton's 

iterative  procedure.  Then  X (t  + 1)  is  obtained  from  X (t)  by: 

a a 

. 4 g(Mt)) 

Xa(t  + 1)  - Xa(t)  - gr^ (t))  (D.7) 


The  initial  value  is  chosen  as 


t 
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V0)  = 


(D.8) 


The  process  is  stopped  when 


X (t)  - \ (t  + 1) 

a < 0.001 


and  the  solution  is  taken  to  be: 

\ 3 M*  + D 

a a 


(D.9) 


(D.10) 
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APPENDIX  E 

In  this  appendix  an  experimental  comparison  is  made  of  the 
quantities  T and  A given  by 


(E.2) 


for  some  particular  values  of  the  parameters  Yftp  Y a2>  Yb^>  and 


Yb2’  In  (E.2)  the  values  o^,  and  are  given  in  terms  of 

Yal’  Ya2’  Ybl’  Yb2’  “d  A by 


- 
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The  function  0 is  defined  by  Equation  (2.3).  The  region  of 
integration  for  T is  the  cross-hatched  region  of  (U.Ub) -plane 
illustrated  in  Figure  47. 


Figure  47.  Region  of  Integration 


For  A,  the  region  is  the  whole  half  plane  above  the  line  U.  = IJ  . 

Ub  U ° 8 

Since  the  product  0(-~| Ybl, Vb2)3("^l is  zero  for  a11 

pairs  ( U , Ub ) outside  the  cross-hatched  region  and  above  the 

line  Ub  = U^,  the  region  of  integration  for  T may  be  considered 

the  same  as  that  for  A.  Note  that  (E.2)  is  obtained  from  (E.l) 

by  replacing  the  beta  d.f. 's  with  Gaussian  d.f. 's  having  the 

same  means  and  variances.  Figure  48 shows  the  beta  d.f.  B(x|Np,Nq) 


•I 
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15.0  4 


12.0  + 


9.0  " 

y 

6.0  ■■ 


3.0 


9(x| 1 ,2) 
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for  p = 1,  q = 2,  and  three  values  of  N;  N = 1,  10,  100.  It 
illustrates  the  convergence  of  a sequence  of  beta  d.f.'s  to  a 
Gaussian  d.f.  as  the  parameters  get  large  while  maintaining  con- 
stant ratio.  This  fact  is  proven  in  References  [36,37], 

Replacing  p and  q with  0,  r with  U.  , and  A , A,  , and  y 

o b a b o 

with  A,  the  result  given  by  (B.13)  can  be  used  to  evaluate  (E.l) 


or 


Ya2_1 


T - B'_1  <\r  V1*"1  <Voa2>  7 CfY  (Tfe) 

al 


j=0 


Yb2~1 


• z ( V)(-  »k  ^ v . + 

k=0 


vbl  + val  * ^ 


Vb2-1 


l (YV  «k  (sir) 

iw-n  Dl 


k=0 


V1 


T - B«'1  (Vi-V,)*-'1  (\i,va2)  l (VfX) 

.1-0 


(.  !,j  Be(v + y«i + 2) 

<v+i> 


(E.4) 


where  the  identity 


Be"1  (Y1,Y2) 


(E.5) 


1 
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resulting  from  the  beta  density  function  integrating  to  one  has 
been  used. 

From  Appendix  C,  A is  given  by 
**b  ~ 


= \ 


where 


*(x)  = j 


& 


dy,  - ® < x < 


Let 


(E.6) 


From  (E.3),  Q becomes 


al 


Q = T* 


-V  * V 


JlL 


Yb!  + Yb2 


^1  Yb2 


u 


(\l  * V>>  <V  + V + 1)  <VM  + YbJ2  (VM  + Yk,  + 1) 


bl  T Tb2'  VYbl  T Yb2 

(E.7) 


S°  that  Yal/Ya2  ^ Ybl/Yb2  are  con3tant,  let  Yal,  Va2,  Ybl>  and 
Vb2  be  written  ^ term8  of  ^h«  constants  p&,  q^,  pb,  and  the 


variable  N as  follows: 

Yal  = Npa 
Va2  = N\ 

Ybl  = Npb 
Vb2  = N% 


(E.8) 


u 


< J 
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T and  A = #(-  Q)  are  computed  for  three  different  examples. 

KxAmnle  1 

Pa  = i,  qa  = 9 
Pb  = i.  qb  = 9 
N = 1,2,.. .,10 

Fjtamalg  2 

Pa  = 2,  = 8 

Pb  = qb  = 9 
N - 1,2, . . . ,10 

\ * ’•  \ ' 7 
Pb  ' 11  % - 9 

N « 1,2 10 

The  locations  of  the  means  in  the  (Ua,Ufe)  plane  of  the  distributions 
for  these  three  examples  are  shown  in  Figure  49. 


Figure  49.  Distribution  Means  for  Examples 
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The  resulting  T and  A are  plotted  in  Figure  50  as  functions  of 
N.  The  circles  represent  values  for  T while  the  triangles  represent 
values  for  A.  These  computations  were  made  with  Purdue  University's 
CDC  6500  digital  computer  using  single  precision.  No  special 
computational  tricks  were  employed  other  than  the  performing  of 
all  possible  factorial  cancellation  in  the  expression  for  T. 

Several  pertinent  facts  are  worth  noting. 

1)  For  these  examples,  close  agreement  between  T and  A 
for  all  except  the  higher  values  of  N is  observed. 

2)  Lack  of  agreement  between  T and  A for  large  N is  due 
to  computer  inaccuracies  resulting  from  accumulated 
error  in  the  many  arithmetic  operations  necessary  to 
compute  T. 

3)  Computational  time  for  T is  approximately  1^  minutes  for 
each  of  the  three  examples.  Computational  time  for  A 

is  relatively  negligible. 

U)  For  T,  accuracy  decreases  and  computational  time  increases 
as  N increases.  For  A,  no  change  in  computational 
accuracy  and  time  required  occurs  for  increasing  N. 


